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Major  Department:  Electrical  Engineering 

The  ultimate  goal  of  this  research  was  to  develop  a 
versatile  articulatory  synthesizer  for  perception  studies, 
harmonizing  high  quality  speech  with  small  computational  time. 

A new  implementation  of  the  articulatory  model  was 
developed  that  provides  an  interactive  graphic  editor  using 
computer  aided  design  techniques.  A digital  time-domain 
approach  was  used  for  the  acoustic  model.  Two  new  glottal 
excitation  models  were  proposed:  a parametric  2-mass  model  and 
a glottal  area  model.  The  tracts  and  the  radiation  were 
simulated  by  an  equivalent  resistive  network.  The  sinuses  and 
the  turbulent  noise  sources  were  included  in  the  synthesizer 
to  simulate  consonants.  The  various  parameters  of  the 
synthesizer  can  be  modified  for  a variety  of  experiments, 
including  the  synthesis  of  various  voice  types  and  vocal 
disorders.  Some  new  parameters,  not  available  in  other 
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synthesizers,  were  designed  into  our  system. 

A new  optimization  scheme  was  conceived  for  deriving 
vocal  tract  area  functions  from  speech,  for  American-English 
phonemes.  The  scheme  uses  both  gradient  search  and  linear 
successive  approximation,  providing  fast  convergence,  errors 
less  than  2%,  and  natural  articulatory  dynamics,  while 
circumventing  local  minima  traps.  The  objective  function  is 
the  least-absolute-value  error  between  the  model-derived  and 
the  speech-derived  first  three  formants.  The  gradient  search 
method  was  improved  with  respect  to  computational  time  by 
using  an  algorithm  inspired  by  the  Fletcher-Reeves  Method. 
Proper  articulatory  dynamics  were  achieved  by  considering  the 
vocal  tract  losses  in  the  area-to-formant  transformation,  by 
establishing  appropriate  initial  configurations,  by  properly 
selecting  the  parameters  for  the  optimization  procedure,  by 
imposing  constraints  on  the  relative  placement  of  the 
articulators,  and  by  using  flexible  on-line  pictorial  aids. 

All  of  these  factors  were  found  to  result  in  an 
articulatory  synthesizer  design  that  produced  natural-sounding 
synthetic  speech  with  small  computational  time. 
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CHAPTER  1 

ARTICULATORY  SYNTHESIZERS 
1 . 1 Introduction 

The  human  body  is  probably  the  most  perfect  machine.  For 
many  years  scientists  and  researchers  have  realized  that 
aspects  of  engineering  can  be  enlightened  by  imitating  the 
natural  mechanisms  of  not  only  the  human  body  but  also  that  of 
other  animals.  Illustrative  examples  are  the  "psychophysical 
effects  of  vision, " which  provided  an  improved  image 
processing  technique  (Kunt  et  al . , 1985)  and  "neural 
networks, " which  many  consider  a revolutionary  approach  for 
solving  complex  pattern-recognition  problems.  In  a like 
manner,  speech  researchers  have  tried  to  model  the  mechanisms 
of  generation  and  propagation  of  sound  waves  in  the  human 
vocal  system. 

Although  formant  and  linear  predictive  coding  (LPC) 
synthesizers  have  been  considerably  improved  in  recent  years 
(Atal  and  Caspers,  1983;  Schroeder  and  Atal,  1985;  Childers  et 
al.,  1985b;  Klatt,  1987;  Pinto  et  al . , 1989;  Childers  and  Wu, 
1990) , the  articulatory  synthesizer  has  a greater  potential  to 
deal  with  some  issues  that  are  essential  for  producing  high- 
quality  and  natural-sounding  speech  at  low  bit  rates:  source- 
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tract  interaction  (Koizumi  et  al . , 1985),  nasals  (Maeda, 
1982b) , parameter  interpolation  (Sondhi  and  Schroeter,  1987) , 
reproduction  of  transitions  between  phonemes  (Coker,  1967), 
etc . 

The  first  sections  of  this  chapter  present  an  overview  of 
articulatory  synthesizers.  Section  1.5  establishes  the  goals 
of  this  research,  and  the  last  section  describes  the  other 
chapters . 

1 . 2 History  of  Articulatory  Synthesizers 

History  registers  von  Kempelen  (Flanagan,  1972b)  as  the 
pioneer  who  attempted  to  reproduce  the  human  voice  (in  1791, 
Vienna)  . His  device  consisted  of  a bellows,  a reed  and  a 
leather  tube  whose  shape  was  controlled  by  hand.  Dudley 
(1939) , however,  can  be  considered  the  "Father  of  Electrical 
Speech  Synthesizers"  by  virtue  of  his  remarkable  "VODER, " the 
first  device  capable  of  coding  voice  signals.  The  VODER 
(voice  operation  demonstrator)  consisted  of  a bank  of  bandpass 
electronic  filters  controlled  by  a keyboard  and  driven  either 
by  a relaxation  oscillator  (voiced  source)  or  by  a noise 
source,  depending  on  the  position  of  a wrist  bar.  A foot 
pedal  determined  the  fundamental  frequency  of  the  voiced 
sounds . 

In  1950,  another  important  step  in  the  area  of  speech 
production  and  perception  was  made  with  the  "Pattern 
Playback, " an  optical-electrical  synthesizer  designed  at 
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Haskins  Laboratories  by  Cooper  et  al.  (1951) . This  device  was 
the  first  to  convert  the  spectrogram  back  into  sound.  In  the 
same  year,  1950,  appeared  the  machine  that  can  be  considered 
the  first  real  electrical  articulatory  synthesizer:  the 
"static  simulators"  designed  by  Dunn  (1950),  the  pioneer  in 
modelling  the  vocal  tract  as  an  electrical  transmission-line. 

A number  of  scientists  and  investigators  have  contributed 
to  improving  articulatory  synthesizers.  Some  of  them  and 
their  major  contributions  are: 

(1)  Fant : acoustic  theory  of  speech  production  (1960), 
glottal  pulse  models  (three-parameter  and  four-parameter 
model:  1979,  1982,  1985),  source-tract  interaction  and 
articulatory  synthesizer  implementation. 

(2)  Kelly  and  Lochbaum:  model  of  traveling  waves  along 
concatenated  tubes  (1962) . 

(3)  Coker  and  Fujimura:  articulatory  model  (1966), 
nasals . 

(4)  Flanagan  et  al . : two-mass  model  for  the  vocal  folds, 
vocal  tract  model  as  a non-uniform  transmission  line 
accounting  for  the  losses  (viscous  friction,  heat-conduction 
and  yielding  walls)  and  for  the  source-tract  interaction, 
articulatory  synthesizer  realization,  and  many  other  important 
contributions  (Flanagan  and  Landgraph,  1968;  Flanagan,  1972a, 
1972b;  Flanagan  et  al . , 1975,  1980). 

(5)  Morse  and  Ingard:  model  of  the  lip  radiation 
impedance  (1968)  . 


4 


(6)  Sondhi  and  Gopinath:  acoustic-to-articulatory 

mapping  using  impulse  response  at  the  lips  (1971) . 

(7)  Portnoff:  equations  of  sound  waves  in  a lossless 
tube  (1973)  . 

(8)  Titze:  vocal  fold  modeling  (1973,  1989) . 

(9)  Mermelstein:  improved  articulatory  model  (1973), 

acoustic-to-articulatory  mapping,  articulatory  synthesizer 
realization . 

(10)  Wakita:  acoustic-to-articulatory  mapping  using  LPC 
analysis  (1973) . 

(11)  Atal  et  al . : acoustic-to-articulatory  mapping  by 
computer-sorting  technique  (1978) . 

(12)  Maeda:  reconsideration  of  the  Kelly-Lochbaum  model 
(1977),  articulatory  model  of  the  tongue  (1979),  nasals 
synthesis  (1982a),  digital  simulation  of  the  vocal  tract 
system  (1982b) . 

(13)  Strube:  time-varying  wave  digital  filter  model 

(1982)  . 

(14)  Rubin  et  al . : modifications  in  the  model  of 

traveling  waves,  including  the  effects  of  the  termination  at 
the  glottis,  lips  and  nostrils,  use  of  digital  filters  and 
reflection  coefficients  to  generate  speech  (1981)  . 

(15)  Childers  et  al . : articulatory  synthesizer 

realization  (1983),  use  of  electroglottography  (1984),  vocal 
fold  modeling  (1986) . 
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(16)  Allen  and  Strong  (1985) , Sondhi  and  Schroeter 
(1987):  hybrid  articulatory  synthesizers. 

(17)  Koizumi  et  al . : modified  two-mass  models  for  the 
vocal  folds  (1987)  . 

(18)  Klatt  et  al . : many  relevant  contributions  toward 
synthesis-by-rule  systems  (1987)  . 

This  list  is  far  from  exhaustive.  Many  other  researchers 
have  indirectly  contributed  to  the  development  of  articulatory 
synthesizers  owing  to  their  achievements  in  the  areas  of 
mathematics,  anatomy,  physiology  and  phonetics. 

1 . 3 The  Human  Vocal  System  and  Speech  Production 

1.3.1  Description  of  the  Vocal  System 

The  schematic  diagram  of  the  human  vocal  system  and  its 
simplified  acoustic  model  are  shown  in  Fig.  1.1  and  Fig.  1.2, 
respectively  (Flanagan,  1972a;  Flanagan  et  al.,  1970).  The 
subglottal  system,  composed  of  the  lungs,  bronchi,  and 
trachea,  generates  the  air  flow.  The  rib  cage  contracts  and 
the  diaphragm  moves  upward  causing  an  air-pressure  increase  in 
the  lungs,  followed  by  a release  of  the  air  through  the 
trachea.  The  air  flow,  with  a volume  velocity  Ug,  passes 
through  the  glottis,  an  opening  between  the  vocal  folds,  in 
the  larynx.  For  voiced  or  mixed  excitation  the  vocal  folds 
and  the  lungs  make  up  a relaxation  oscillator,  that  is,  the 
vocal  folds  successively  open,  due  to  the  subglottal  pressure, 
and  close,  because  of  the  Bernoulli  effect.  The  vocal  fold 
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Figure  1.1 


Human  vocal  mechanism  (Flanagan,  1972a) . 


Model  of  the  human  vocal  system 
(Flanagan  et  al.,  1970). 


Figure  1.2 


7 


vibrations  produce  air  pulses,  represented  by  a volume- 
velocity  waveform,  which  excites  the  vocal  tract.  The 
fundamental  frequency  of  this  waveform  depends  on  the 
compliance,  mass,  length  and  elasticity  of  the  vocal  folds. 
For  the  "unvoiced"  sounds  the  vocal  folds  do  not  vibrate;  the 
air  is  forced  through  a constriction  in  the  vocal  tract,  with 
a sufficiently  high  Reynolds  number,  producing  turbulence. 

The  power  spectrum  of  voiced  excitation  has,  typically, 
an  average  slope  of  -12  dB/octave,  whereas  the  unvoiced 
excitation  has  an  almost  flat  spectrum. 

The  vocal  tract  is  a lossy,  nonuniform  and  "slowly  time- 
varying  acoustic  tube"  that  extends  from  the  glottis  to  the 
lips.  Its  cross-sectional  area  depends  primarily  on  the 
placement  of  the  tongue  and  also  on  the  positions  of  the  other 
articulators:  lips,  jaw,  hyoid  and  velum.  The  constriction 
(smallest  cross-sectional  area)  is  caused  by  the  tongue  body 
or  by  the  tongue  tip,  for  vowels  and  consonants,  respectively. 
Figure  1.3  shows  the  midsagittal  plane  of  the  vocal  system. 
Formants,  the  resonance  frequencies  of  the  tube  (peaks  in  the 
spectrum) , vary  according  to  the  area  function,  the  cross- 
sectional  area  along  the  vocal  tract.  The  formant  frequencies 
of  the  phonemes  are  associated  with  the  resonances  of  the  back 
and  front  cavities  (with  respect  to  the  location  of  the 
constriction) . For  nasals,  the  nasal  tract  is  coupled  to  the 
vocal  tract,  affecting  the  formant  values. 
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Figure  1.3  Midsagittal  plane  of  vocal  system 
(Jassen  and  Nolan,  1984)  . 


Ampl . 


Figure  1.4  Vocal  tract  configurations  for  the  vowels  IY  and 
UW  and  corresponding  spectra  (Jassen  and  Nolan, 
1984)  . 
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Figure  1.4  sketches  the  configurations  for  two  vowels  and 
their  corresponding  spectra.  The  process  is  dynamic:  for  time 
varying  distinctive  shapes,  distinctive  patterns  of  formants 
produce  the  concatenation  of  specific  phonemes.  The  dynamic 
shaping  of  the  vocal  tract  during  the  generation  of  speech  is 
called  articulation.  The  transitional  movements  of  the 
articulators  along  with  the  excitation  is  the  generating 
mechanism  for  the  phonemes. 

The  acoustic  signal  for  a given  phoneme  may  vary, 
depending  on  the  nature  of  its  neighboring  phonemes  (Kent  and 
Minife,  1977) . A speech  segment  can  affect  the  utterance  of 
another  segment  "placed"  either  to  its  left  (anticipatory 
coarticulation) , or  to  its  right  (carry-over  coarticulation) . 

The  nasal  tract  extends  from  the  velum  to  the  nostrils 
(Fig.  1.2)  . To  produce  nasal  sounds,  the  velum  is  lowered  and 
the  nasal  tract  becomes  connected  to  the  vocal  tract  through 
the  velopharyngeal  port.  The  lips  and  nostrils  are  the 
boundaries  for  the  radiation  of  sounds. 

1.3.2  Excitation  Mechanisms  and  Place  of  Articulation 

Flanagan  et  al.  (1970)  identified  three  major  excitation 
mechanisms  for  producing  sounds: 

(1)  The  vocal  folds  vibrate  due  to  Bernoulli's  law 
(interaction  between  subglottal  pressure  and  the  lower 
Bernouli  pressure  in  the  constriction  formed  by  the  glottis) , 
producing  the  quasi-periodic  pattern  of  voiced  sounds,  e.g., 
vowels,  diphthongs  and  semivowels. 
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(2)  Turbulent  flow  is  created  when  the  air  passes 
through  a constriction  (with  a Reynolds  number  greater  than 
the  critical  Reynolds  number  of  the  constriction) , generating 
unvoiced  sounds,  e.g.,  unvoiced  fricatives  FF,  TH,  SS  and  SH. 

(3)  The  pressure  buildup  behind  a constriction  in  the 
vocal  tract  is  suddenly  released,  producing  plosive  sounds 
such  as  the  voiced  stops  BB,  DD  and  GG  as  well  as  the  unvoiced 
stops  PP,  TT  and  KP . 

These  mechanisms  may  be  combined  to  produce  some 
phonemes : 

(a)  Mixed  excitation  sounds,  e.g.,  the  voiced  fricatives 
W,  DTH,  ZZ  and  ZH  are  produced  using  the  combined  mechanisms 
(1)  and  (2)  . 

(b)  Voiced  stops  are  produced  by  combining  mechanisms 
(1)  and  (3)  . 

(c)  Affricates  JJ  and  CH  are  produced  by  combining 
mechanisms  (3)  and  (2)  . 

Other  phenomena  for  producing  sounds  are: 

(a)  The  whisper  phoneme  HH  which  is  produced  by  a 
turbulent  flow  originating  at  the  glottis,  without  vibration 
of  the  vocal  cords. 

(b)  In  the  unvoiced  stop  production,  after  the  release 
of  the  pressure,  there  is  a short  period  of  friction,  followed 
by  a period  of  steady  air  flow  through  the  glottis,  called 
"aspiration,"  (Rabiner  and  Schafer,  1978). 
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(c)  Antiresonances  (zeros  in  the  transfer  function) 
occur  in  the  vocal  tract  during  the  generation  of  nasals  and 
nasalized  vowels,  by  virtue  of  the  interconnection  of  the 
nasal  and  oral  cavities,  and  during  the  production  of  unvoiced 
fricatives  because  of  the  coupling  between  the  back  and  the 
front  cavities. 

Different  phonemes  with  the  same  mechanisms  of  excitation 
are  generated  due  to  the  different  placement  of  the 
articulators  (different  configurations  of  the  vocal  tract) . 

The  vowels  are  classified  as  "front"  (IY,  IH,  EH,  and 
AE) , "middle"  (AA,  ER,  AH,  and  AO)  or  "back"  (UW,  UH,  and  0) , 
in  accordance  with  the  position  of  the  tongue  hump.  Figure 
1.5  shows  the  vowel  quadrilateral,  a representation  of  vowels 
that  consider  not  only  the  tongue  position  but  also  the  degree 
of  opening  of  the  vocal  tract.  The  vowels  at  the  extremes  of 
the  quadrilateral  are  known  as  the  "cardinal  vowels." 

For  the  diphthongs,  the  vocal  tract  successively  takes  on 
the  shape  of  the  constituent  vowels. 

For  the  semivowels,  the  articulators  start  from  an 
initial  position,  and  move  to  the  configuration  of  the  vowel 
that  follows  the  specific  semivowel.  For  the  glide  YY  the 
vocal  tract  starts  from  the  configuration  of  an  IH  (closed); 
then,  the  tract  opens  and  the  tongue  moves  to  the  next 
position.  The  glide  (or  retroflex)  RR  starts  with  the  tongue 
tip  near  the  roof  of  the  mouth.  A large  cavity  is  formed 
under  the  tongue.  The  liquid  (or  lateral)  LL  starts  with  the 
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tongue  tip  near  the  alveolar  ridge  and  moves  more  rapidly  than 
RR  does  to  the  next  position.  The  liquid  WW  starts  from  the 
same  configuration  of  an  UH,  with  the  lips  rounded,  and  then 
move  quickly  to  the  new  position. 

For  the  bilabial  stops  BB  and  PP  the  constriction  is 
produced  by  the  lips,  for  the  apico-alveolar  stops  DD  and  TT, 
by  the  tongue  tip  against  the  alveolar  ridge,  and  for  the 
velar  stops  GG  and  KK,  by  the  tongue  back  against  the  velum. 
Figure  1.6  shows  these  configurations. 

The  nasals  are  produced  when  the  velum  is  lowered 
(coupling  the  nasal  tract  to  the  pharynx)  and  the  vocal  tract 
is  closed  at  the  same  point.  The  nasal  MM  is  produced  with 
the  closure  of  the  lips,  NN  with  the  tongue  at  the  roof  of  the 
mouth,  and  NG  with  the  closure  at  the  velum. 

The  fricatives  W and  FF  are  generated  by  placing  the 
upper  incisors  against  the  lower  lip  (labio-dental) ; DH  and  TH 
with  the  tongue  tip  against  the  incisors  (apico-dental) ; SS 
and  ZZ  with  the  tongue  tip  near  the  tooth  ridge  and  the  rows 
of  teeth  together,  and  finally,  SH  and  ZH  with  the  tongue  back 
near  the  hard  palate  and  with  rounded  lips  (palatal) . 

For  affricates  JJ  and  CH,  the  articulatory  movements  are 
approximately  those  of  the  constituent  stop  and  fricatives. 

Table  1.1  summarizes  some  articulatory  features  of  the 


phonemes . 
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FRONT  MIDDLE  BACK 


Figure  1.5  The  vowel  quadrilateral  (Jassen  and  Nolan,  1984). 


Figure  1.6  Vocal  tract  configurations  for  stops 
(Jassen  and  Nolan,  1984)  . 
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TABLE  1.1:  ARTICULATORY  FEATURES  OF  THE  PHONEMES 


Conti- 

nuity 

Classes 

Voiced/ 

Unvoic. 

Constriction 

Phoneme/ 

Example 

front 

IH  /bit 
IY  /beet 
EH  /bet 
AE  /bat 

Vowels 

V 

middle 

AA  /hot 
ER  /bird 
AH  /but 
AO/bought 

Conti- 

back 

UW  /boot 
UH  /foot 
0 /rod 

nuant 

Frica- 

tives 

V 

lower  lip/up. incisors 
tongue  tip/incisor 
tong,  tip/tooth  ridge 
tong. back/hard  palate 
lower  lip/up . incisors 

W /vet 
DH  /this 
ZZ  /zero 
ZH  /azure 
FF  /fed 

U 

tongue  tip/incisor 
tong,  tip/tooth  ridge 
tong. back/hard  palate 

TH  /thin 
SS  /set 
SH  /shed 

Nasals 

V 

lips 

roof  of  mouth 
velum 

MM  /met 
NN  /net 
NG  /song 

Non- 

conti- 

Diph- 

thongs 

V 

AY  /buy 
OY  /boy 
AW  /how 
EY  /bay 
OW  /boat 
YU  /vou 

Liquid 

Semivowels 

ww  /wet 

TJ,  /Ipf 

nuant 

Glide 

Semivowels 

V 

RR  /red 
YY  /yet 

Stops 

V 

lips 

alveol . ridge /tong . tip 
tongue  back 

BB  /bet 
DD  /debt 
GG  /get 

u 

lips 

alveol . ridge /tong. tip 
tonaue  back 

PP  /pet 
TT  /tap 
KK  /kit 

Affri- 

cates 

V 

u 

D =>  ZH 
T =>  SH 

JJ  /June 
CH  /chat 

Phonetic  segments  use  Klatt  symbols  (Jonathan  et 
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1.3.3  A Basic  Articulatory  Synthesizer 

The  rudiments  of  articulatory  synthesizers  can  be 
understood  now.  At  any  instant,  the  placement  of  articulators 
determines  the  shape  of  the  vocal  tract  (cross-sectional  area 
along  the  vocal  tract  or  area  function) . Each  specific  shape 
determines  the  production  of  one  continuant  phoneme  (vowel, 
fricative  or  nasal) , while  a "changing  shape  or  configuration" 
occurs  during  the  generation  of  one  noncontinuant  phoneme 
(semivowel,  diphthong,  stop  or  affricative) . Therefore,  a 
basic  articulatory  synthesizer  can  be  modeled  as  shown  in  Fig. 
1.7. 

Articulatory  models  can  be  classified  in  two  major  types: 
"midsagittal-distance  articulatory  models"  and  "area 
articulatory  models."  The  "midsagittal-distance  articulatory 
model"  input  is  the  position  of  the  articulators  (tongue  body, 
tongue  tip,  jaw,  lips,  hyoid  and  velum)  with  respect  to  a 
midsagittal  plane  and  its  output  is  an  estimation  of  the  area 
function  that  feeds  the  "slowly  time-varying  filter" 
represented  by  the  vocal  tract.  The  performance  of  these 
models  can  be  assessed  by  comparing  their  output  midsagittal 
distances  with  those  given  by  x-ray  pictures.  Figure  1.8 
depicts  the  seven-parameter  articulatory  model  used  by 
Flanagan  et  al . (1970)  . The  model  of  Levinson  and  Schmidt 
(1983),  shown  in  Fig.  1.9,  uses  six  parameters:  tongue  body 
height  al,  anterior/posterior  position  of  the  tongue  body  a2, 
tongue  tip  height  a3,  mouth  opening  a4,  pharyngeal  opening  a5, 
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GLOTTAL 

< — • PARAMETERS 


A basic  articulatory  synthesizer. 


Figure  1.7 
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lips 

**  l 


(W,L) 


Figure  1.8  Seven-parameter  articulatory  model 

of  Flanagan  et  al . (1970). 


tongue  body  height 
unterior/posterior 
position  of  tongue 
tongue  tip  height 
mouth  opening 
pharyngeal  opening 
vocal  tract  length 


body 


Figure  1.9  Six-parameter  articulatory  model 
of  Levinson  and  Schmidt  (1983). 
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and  vocal  tract  length  a6.  It  was  based  on  Coker's 
articulatory  model  (1976)  . The  model  devised  by  Mermelstein 
(1973)  is  depicted  in  Fig.  1.10.  For  this  model,  the  movable 
structures  are  the  jaw,  lips,  velum,  hyoid,  tongue  body  and 
tongue  blade.  The  positions  of  the  lips  and  tongue  body  are 
relative  to  the  placement  of  the  jaw.  The  tongue-tip  location 
is  relative  to  the  tongue  body  coordinates.  The  area  function 
is  derived  from  the  midsagittal  dimensions  of  the  vocal  tract, 
using  empirical  formulas  (Heinz  and  Stevens,  1965;  Ladefoged 
et  al . , 1971)  . Figure  1.11  shows  the  grid  that  is 
superimposed  on  the  vocal  tract  plane  in  order  to  measure  the 
midsagittal  distances. 

The  "area  articulatory  models"  deal  directly  with  the 
cross-sectional  area  on  both  sides  of  the  constriction, 
instead  of  parameterizing  the  sagittal  distances.  The  first 
known  simulations  of  this  kind  are  the  three-parameter  model 
due  to  Stevens  and  House  (1955)  and  the  three-parameter  model 
by  Fant  (1960)  . The  input  parameters  are  the  position  of  the 
tongue-body  constriction,  its  minimum  area,  and  other 
parameters  related  to  the  lips  or  to  the  tract  length. 

Figure  1.12  shows  the  model  by  Atal  et  al . (1978),  and  Fig. 
1.13  depicts  Ishizaka's  model  (Flanagan  et  al . , 1980). 

Other  articulatory  models  available  in  the  literature 
were  introduced  by  Maeda  (1979),  Heike  (1979),  Rubin  et  al. 
(1981),  Kubin  and  Pikturna  (1987),  etc. 
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TONGUE  BODY  (sc  , 0C) 
TONGUE  BLADE  (st  , 0t) 
JAW  (Sj  , 03) 


LIPS  (Pi  , hi) 

HYOID  : p 

VELUM  : along  Vi~V2 


Figure  1.10  Mermelstein' s articulatory  model 
(Mermelstein,  1973) . 


Figure  1.11  Grid-system  to  measure  sagittal  distances 
(Mermelstein,  1973) . 
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S(x)  : area  function  (cm2) 

S (x)  = 2,  if  x < 2 

= Kf  if  l-lm  < x < 1 

= 4.5  - (4 .5-Ac)  cos  [0 .3  (x-Xc)  if  2 < x < l-lm 
Xc  ‘ distance  of  the  place  of  maximum 

constriction  from  the  glottis  (cm) 

Ac  : cross-sectional  area  at  the  place 
of  maximum  constriction  (cm2) 

Am  : area  of  the  mouth  opening  (cm2) 

1 : vocal-tract  length  lm  = (1-14) 1/2 


Figure  1.12  Area  articulatory  model  of  Atal  et  al . , 1978. 


A (x)  = 
where : 


[ (Ab+Ac)/2]-[  (Ab-Ac)  /2]  cos  [7C  (xc-x)  /lb]  , 
[ (Af+Ac)/2]-[  (Af-Ac)/2]  K 
K = cos{7t[0.4  + 0.6(x-xe)/lf]  (x-xc)/l£) 


Ac 

13 


< 

< 


Af 

21 


lb  = 8L/17 


Ac  > 0 

L/10  < xc  < 9L/10 
lf  = 7L/17 


x < xc 
X > xc 


Figure  1.13  Area  articulatory  model  of  Flanagan  et  al . , 1980. 
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The  vocal  tract  filter  block  in  Fig.  1.7  is  modeled  as  a 
concatenation  of  lossy  transmission-line  elements  (Flanagan, 
1970)  , which,  in  turn,  can  be  represented  by  lumped-electronic 
components  whose  values  depend  on  the  area  function. 

The  glottal  source  block  represents  the  glottis  and 
vocal  folds  during  the  voiced-sound  excitation  to  the  filter. 
The  radiation  block  represents  the  effect  of  the  lips  and 
nostrils.  The  excitation  for  the  sounds  characterized  by 
turbulence  (fricatives,  stops,  etc.)  is  modeled  by  a noise 
source  inserted  usually  near  the  constriction  point.  For 
aspirated  sounds  the  noise  source  is  placed  at  the  glottis. 
The  dynamics  of  the  system  are  dictated  by  the  characteristics 
of  the  phonemes.  Each  phoneme  is  characterized  by  its 
duration,  its  place  and  its  manner  of  articulation.  Each 
phoneme  is  represented  by  a configuration  stored  in  a database 
as  a "target"  for  the  synthesizer.  The  data  between  targets 
is  estimated  by  interpolation  (Flanagan  et  al.,  1975;  Mrayati 
et  al.,  1988).  This  target-based  articulatory  synthesizer  is 
supported  by  the  work  of  MacNeilage  (1970)  and  Gay  (1977)  . 
For  some  VCV  (vowel  consonant  vowel)  combinations  the  Ohman's 
diphthongal  hypothesis  of  anticipatory  coarticulation  (Ohman, 
1967)  may  apply. 

In  general,  three  main  approaches  have  been  used  in 
articulatory  synthesizers.  The  time-domain  method  consists  of 
solving,  in  each  sampling  interval,  the  set  of  stiff 
differential  equations  (Gear,  1971)  that  describe  the 
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electronic  network  model  for  the  entire  system  (Flanagan  et 
al . , 1975,  1980;  Maeda,  1982a).  Computer-aided  analysis  of 
electronic  circuits  is  employed  to  simplify  the  computation. 
Natural-sounding  speech  can  be  achieved  but  the  computational 
burden  is  high  (Bocchieri,  1983) . 

The  wave  digital  filter  method  links  digital  filtering  to 
the  model  of  forward  and  backward  traveling  waves  in  a 
lossless  acoustic  tube  (Kelly  and  Lochbaum,  1962;  Titze,  1973; 
Maeda,  1977;  Rubin  et  al . , 1981;  Strube,  1982).  The  great 
advantage  of  this  method  comes  from  the  use  of  digital 
filtering  techniques,  which  can  provide  a fast  realization, 
even  real-time  synthesis  (Meyer  et  al . , 1989).  The  problems 
are  related  to  an  oversimplified  model  of  the  glottis,  to 
neglecting  the  source-tract  interaction,  and  to  ignoring 
losses  in  the  tracts.  Nevertheless,  some  progress  has  been 
made  using  this  method  (Kabasawa  et  al . , 1983;  Meyer  et  al . , 
1989) . 

The  third  method,  known  as  the  hybrid  time-frequency 
domain  technique  (Allen  and  Strong,  1985;  Sondhi  and 
Schroeter,  1987) , models  the  nonlinear  characteristics  of  the 
glottis  in  the  time  domain  and  models  the  vocal  and  nasal 
tract  in  the  frequency  domain.  The  two  domains  are  interfaced 
by  inverse  Fourier  transform  and  digital  convolution 
techniques  (Sondhi  and  Schroeter,  1987) . Other  inverse- 
transform  algorithms  can  also  be  used  to  obtain  the  time- 
domain  representation  from  the  frequency-domain  vocal  tract 
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model  (Childers  et  al . , 1983;  Koizumi  et  al . , 1985).  This 
third  method,  although  fast,  is  not  capable  of  reproducing  the 
dynamic  transitions  of  certain  phonemes  (plosives,  for 
example)  with  an  acceptable  quality. 

Another  option  is  to  model  the  complete  system  in  the 
frequency  domain,  considering  only  a stationary  glottal 
impedance  (Lin,  1990) . 

1 . 4 Models  for  Articulatory  Synthesizers 

1.4.1  Source  or  Excitation  Models 

The  glottal  source  block  in  Fig.  1.7  represents  the 
subglottal  system  and  the  vocal  folds  and  generates  the 
voicing  excitation,  i.e.,  the  volume-velocity  waveform. 

The  excitation  models  can  be  grouped  into  two  main 
classes:  mechanical  models  and  parametric  models. 

1.4. 1.1  Mechanical  models 

These  models  represent  the  glottal  source  as  lumped 
mechanical  oscillators  or  as  distributed  mechanical  systems. 

For  the  lumped  or  discrete  mechanical  models,  the 
subglottal  system  is  represented  by  an  air  reservoir  with 
pressure  Ps  that  provides  an  air  flow  with  the  volume  velocity 
Ug,  as  shown  in  Fig.  1.2.  The  vocal  folds  are  mechanically 
modeled  by  an  oscillatory  system  of  masses,  viscous  damping, 
and  springs.  Flanagan  and  Landgraf  (1968)  conceived  the  one- 
mass  model  and  later  Ishizaka  and  Flanagan  (1972)  improved  it, 
introducing  the  two-mass  model  of  the  vocal  folds.  Titze 
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(1973)  devised  a 16-mass  model  (three-dimensional)  and  Koizumi 
et  al . (1987)  proposed  some  modifications  for  the  two-mass 

model  in  order  to  consider  the  effects  of  the  mucosal  surface 
wave  and  the  vertical  phasing  in  the  vertical  fold  vibration. 
The  models  are  shown  in  Fig.  1.14  (one-mass),  Fig.  1.15  (two- 
mass),  and  Fig.  1.16  (sixteen-mass) . Some  characteristics  of 
these  discrete  models  are  listed  in  Table  1.2. 

Distributed  mechanical  systems  (continuum  models) 
represent  the  vocal  folds  as  a continuous  deformable  medium 
(Titze  and  Strong,  1975;  Titze  and  Talkin,  1979) . The  model 
of  Titze  and  Strong  is  shown  in  Fig.  1.17. 

There  is  an  important  tradeoff  to  be  considered:  the 
higher  the  sophistication  of  the  physical  model  to  account  for 
all  the  mechanisms  of  the  vocal  folds,  the  lower  the 
computational  burden.  Besides,  a model  capable  of  simulating 
all  the  movements  of  the  tissues  and  all  boundary  conditions 
is  extremely  difficult  and  complex. 


1.4. 1.2  Parametric  models 

The  major  approaches  for  glottal  source  parameterization 

are : 


(1) 

To 

specify 

directly  a 

glottal 

volume 

velocity 

waveform 

in 

terms 

of  time  parameters 

( "noninteractive 

parametric 

models") . 

(2) 

To 

specify 

the  glottal 

area  (or  other 

glottal 

features)  in  terms  of  acoustic  parameters  and  then  to  obtain 
the  glottal  volume  velocity,  by  using  an  equivalent  circuit 
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cords 


Figure  1.14  One-mass  mechanical  model  of  the  vocal  folds 
(Flanagan  and  Landgraf,  1968)  . 


T I 1 

contraction  glottis  expansion 


Figure  1.15  Two-mass  mechanical  model  of  the  vocal  folds 
(Ishizaka  and  Flanagan,  1972)  . 
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Sixteen  mass  mechanical  model  of  the  vocal 
folds  (Titze,  1973)  . 


Figure  1.16 
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TABLE  1.2  CHARACTERISTICS  OF  THE  SELF-OSCILLATING 

MODELS  OF  THE  VOCAL  FOLDS 

ONE-MASS  MODEL 

* SIMPLE 

* LOW  COMPUTATIONAL  BURDEN 

* EXCESSIVE  SOURCE-TRACT  INTERACTION 

* PHASE-DIFFERENCE  BETWEEN  THE  MOTION  OF  FOLD  EDGES 

IS  DISREGARDED 

* GLOTTAL  AREA  AND  VOLUME  VELOCITY  CAN  BE  FAIRLY 

WELL  SIMULATED 

TWO-MASS  MODELS 

* REALISTIC  SIMULATION  OF  GLOTTAL  PROPERTIES 

* PHASE-DIFFERENCE  BETWEEN  THE  MOTION  OF  FOLD  EDGES  IS 

CONSIDERED 

* MUCOSAL  SURFACE  WAVE  IS  NOT  CONSIDERED  IN  FLANAGAN 

AND  ISHIZAKA' S MODEL  BUT  IT  IS  IN  KOIZUMI'S  MODEL 

* NATURAL  SPEECH  CAN  BE  PRODUCED 

* REASONABLE  COMPUTATIONAL  BURDEN 

SIXTEEN-MASS  MODEL 

* COMPLEX 

* MUCOSA  SURFACE  WAVE  CAN  BE  SIMULATED 

* HIGH  COMPUTATIONAL  BURDEN 
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that  takes  into  account  the  sub-glottal  and  supra-glottal 
loads  ("interactive  parametric  models") . 

(3)  To  combine  a mechanical  model  with  an  acoustical 
parametric  model  ("hybrid  glottal-source  models,"  Childers  et 
al.,  1986/  Titze,  1989). 

Noninteractive  parametric  models.  Normally,  these  models 
are  specified  by  the  fundamental  period,  amplitude,  open 
quotient  (ratio  of  pulse  duration  to  pitch  period)  , and  speed 
quotient  (ratio  of  the  rising  to  falling  pulse  durations) . 
The  rising  and  falling  phases  of  the  waveform  are  described  by 
polynomial  functions,  according  to  the  model.  Figure  1.18 
depicts  the  glottal  volume  velocity  and  its  differentiated 
waveform  for  the  LF  model,  proposed  by  Fant  et  al . (1985). 
Other  examples  are  the  three-parameter  model  by  Rothenberg  et 
al.  (1975)  and  the  more  elaborate  model  by  Klatt  (1987). 

Interactive  parametric  models.  Rothenberg  (1981) 
parameterized  the  glottal  conductance.  The  model  of 
Ananthapadmanabha  and  Fant  (1982)  assumes  that  the  first 
formant  of  the  vocal  tract  is  the  most  influential  in  the 
source-tract  interaction.  The  circuit,  shown  in  Fig.  1.19, 
simulates  three  subsystems: 

(1)  Glottis:  modeled  as  an  impedance  Zg  (van  den  Berg  et 
al . , 1957)  with  a resistive  part  composed  by  Rgk(t)  (kinetic 
losses)  and  Rgv(t)  (viscous  loss)  and  with  an  inductive  part 
Lg(t)  . Both  resistive  and  inductive  parts  vary  inversely  with 
the  glottal  area  function  Ag(t),  which,  in  turn,  depends  on 
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VOCAL  CORD  : Rectangular  Parallelepiped 
Y : displacement  vector  of  the  differential  element 
x : longitudinal  stress  of  the  differential  element 


Figure  1.17  A continuum  model  of  the  vocal  folds 

(Titze  and  Strong,  1975) . 
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EQUATIONS  FOR  FLOV  DERIVATIVE: 


(t  < TE)  : E(t)  = Eg  e a ^ sinwt 


(TE  < 

where: 


-EE 

< TC) : E ( t ) = 

cT  A 


e- c( t -TE) _e- e(TC-TE) j 


Eg,  a,  ajg  AND  c ARE  DETERMINED  BY  TP,  TE,  TA  AND  EE 

TP  = GLOTTAL  FLOW  PEAK 

TE  = MAXIMUM  CLOSING  DISCONTINUITY 

TA  = MAXIMUM  CLOSURE 

TC  = COMPLETE  FUNDAMENTAL  PERIOD 

EE  = NEC  AT  I V.E  PEAK  VALUE  OF  FLOW  DERIVATIVE 


Figure  1.18  Noninteractive  glottal  model:  LF  model 
(Fant  et  al . , 1985) . 
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some  factors,  as  voice  intensity,  pitch,  phonatory  mode,  etc. 
The  pressure  inside  the  glottis  decreases  linearly  with  the 
distance  from  the  inlet  as  a result  of  the  viscous  resistance 
to  the  air  flow.  Figure  1.20  shows  the  pressure  distribution 
along  theglottal  flow  (Ishizaka  and  Flanagan,  1972)  . The 
glottal  area  function  is  almost  insensitive  to  the  variations 
in  the  vocal  tract  area,  because  the  input  impedance  of  the 
vocal  tract  is  much  lower  than  that  of  the  vocal  fold  system. 
On  the  other  hand,  the  glottal  volume  velocity  is  affected  by 
the  variations  of  the  vocal  tract  during  the  open  glottal 
interval,  when  the  glottal  impedance  decreases  (glottal  area 
increases),  becoming  comparable  to  the  input  impedance  of  the 
vocal  tract  (Krishnamurthy  and  Childers,  1986)  . This 
interaction  can  be  analyzed  as  shown  in  Fig.  1.21.  The 
dependency  of  the  volume  velocity  on  the  supraglottal 
impedance  is  called  "acoustic  interaction"  and  the  dependency 
of  the  "glottal  vibratory  patterns  and  thus  of  the  glottal 
area  function  on  the  overall  state  and  aerodynamics"  is  called 
"mechanical  interaction"  (Fant  and  Lin,  1987,  p.  13)  . The 
result  is  that  the  volume-velocity  waveform  skews  to  the  right 
with  respect  to  the  glottal  area  and  may  display  an 
oscillatory  ripple  whose  frequency  is  almost  twice  that  of  the 
first  formant  (Ananthapadmanabha  and  Fant,  1982;  Fant,  1982) . 
In  reverse,  the  source  affects  the  tract.  The  time-varying 
damping  impedance  represented  by  the  glottis  causes  a small 
(significant  in  terms  of  naturalness  of  speech)  upward-shift 
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Sub-glottal 

System 


Supra-glot  tal 
System 


^sg  Glottis  Rj  R2 


PL  = {10  (1  + mfo  (FO  - FOav)  / FOav)  }q 
Rgk(t)  = p(k,  - k2)  [ | Ug  (t ) | / A2g(t)] 

RgV(t)  = I2ll[dgl2g  / A3g(t)]  Ag  ( t ) = AgQ  + ^ ~ AgQ)  A(t) 

Lg(t)  = p[dg/Ag(t)]  Q = 1 + ^ y ( £ ^ ~ AV,  y ) / AV,  „ 

where : 

P : viscosity  of  air  (0.000186  dyn-s/cm2) 
density  of  air  (0.00114  g/cm3) 

: entry  drop  coefficient  (1.37  nominal) 

: exit  recovery  coefficient  (0.3  nominal) 

. vocal-fold  thickness  (0.05-0. 25cm  typical) 

: length  of  the  glottal  aperture  (1.0-1. 5 cm  typical) 

: volume-velocity  (0-1000  cc/s  typical) 

: glottal  cross-sectional  area  (0.0-0. 3cm2  typical) 

: voicing  level  (dB)  AVav  . average  AV 

•fundamental  frequency  (Hz)  FOav  : average  FO 

and  mfo  : modulation  factors 
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Figure  1.19  Equivalent  circuit  for  dynamic  glottal  flow. 


pressure 
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Figure  1.20  Pressure  distribution  along  the  glottal  flow 

(Ishizaka  and  Flanagan,  1972) . 


Figure  1.21  Circuit  for  source-tract  interaction  analysis. 
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in  the  formant  frequencies,  an  increment  in  the  bandwidths  of 
the  vocal  tract,  and  a decrement  in  the  formant  intensities 
(Klatt,  1987;  Koizumi  et  al . , 1987). 

The  subglottal  subsystem  also  "loads"  the  vocal  tract, 
when  the  glottal  impedance  is  sufficiently  low.  Other  effects 
(for  a few  vowels)  are  summarized  by  Sondhi  and  Schroeter 
(1987)  and  by  Klatt  (1987).  The  kinetic  component  Rgk(t) 
accounts  for  the  losses  due  to  the  contraction  at  the  glottis 
inlet  and  to  the  expansion  at  the  glottis  outlet. 

(2)  Subglottal:  modeled  as  a lung  pressure  Ps  in  series 
with  the  impedance  represented  by  the  parallel  resonant 
circuit  Rsg  Lsg  Csg(Ishizaka  et  al.,  1976) . Ananthapadmanabha 
and  Fant  (1982)  used  three  cascaded  RLC  resonant  circuits  and 
verified  that  only  the  first  resonance  was  significant.  The 
lung  pressure  Ps  is  related  to  the  voicing  level  AV 
(intensity)  and  to  the  fundamental  frequency  FO.  The  effects 
of  the  subglottal  pressure  on  the  sound  pressure  and  pitch  are 
described  by  Ladefoged  and  Mckinney  (1963),  Isshiki  (1964), 
and  Fant  (1982) . The  empirical  formula  for  Ps,  proposed  by 
Pinto  et  al.  (1989)  is  given  in  Fig.  1.19. 

(3)  Supraglottal : simulates  the  loading  due  to  the  vocal 
tract  and  is  modeled  as  a series  combination  of  two  resonant 
circuits  R1  LI  Cl  and  R2  L2  C2  (Guerin  et  al . , 1976;  Guerin, 
1983).  Koizumi  et  al . (1985)  investigated  the  source-tract 
interaction  when  various  subglottal  and  glottal  formant  loads 
were  considered  and  confirmed  that  the  first  supraglottal 
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formant  is  the  major  cause  of  the  skewness  in  the  volume 
velocity  waveform.  Some  data  are  provided  by  the  work  of 
Ananthapadmanabha  and  Fant  (1982)  . 

The  glottal  area  function  simulation  has  been  made  by  a 
few  researchers.  The  parameters  of  Fant's  model  are  the 
pitch  period  T,  the  duration  of  the  glottal  open  phase  To,  the 
duration  of  the  closing  phase  Tc  and  the  maximum  area  Agm. 
The  parameters  of  the  model  by  Ananthapadmanabha  and  Fant 
(1982),  shown  in  Fig.  1.22,  are  the  pitch  period  T,  the  open 
quotient  Qo,  the  speed  quotient  Qm,  the  minimum  glottal  area 
Ago  and  the  amplitude  of  the  glottal  function  Agamp.  A 
similar  model  for  the  glottal  area  function  uses  the  area 
function  A(t),  proposed  by  Titze  (1982),  whose  parameters  are 
the  pitch  period  T,  the  open  quotient  Qo,  the  speed  quotient 
Qm  and  the  slope  factor  SL.  This  model  is  depicted  in  Fig. 
1.23. 

Three-dimensional  parametric  models.  A kinematic  four- 
parameter  model  for  the  three-dimensional  glottis  was 
presented  by  Titze  (1989)  . The  model  can  provide  glottal 
flow,  glottal  area  and  vocal  fold  contact  area  waveforms.  The 
static  glottis  is  controlled  by  the  abduction  quotient  Qa,  the 
shape  quotient  Qs,  and  the  bulging  quotient  Qb  (Fig.  1.24). 
The  phase  quotient  Qp  and  the  fundamental  frequency  FO  control 
the  dynamic  glottis. 
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1-4. 1.3  Data  for  subqlottal  models 

The  control  parameter  is  the  subglottal  pressure  Ps, 
which  is  related  to  the  energy  of  speech,  thus,  to  the  voice 
intensity . Two  noninvasive  techniques  are  most  commonly  used 
to  obtain  Ps:  transglottal  pressure  transducer  (Koike  and 
Perkins,  1968)  and  intraesophageal  pressure  (van  den  Berg, 
1956)  . The  former  technique  consists  of  the  direct 
measurement  of  the  output  pressure  by  a transducer  carefully 
placed  in  the  subglottal  cavity.  The  latter  technique 
determines  Ps  indirectly,  by  relating  it  to  the 
intraesophageal  pressure,  measured  by  an  air-balloon  (with  a 
tube)  introduced  into  the  esophagus.  Some  empirical  data  are 
available  from  Ladefoged  and  McKinney  (1963),  Isshiki  (1964) 
and  Fant  (1982)  . 

1-4. 1.4  Data  for  glottal  models 

The  parameters  for  the  excitation  are  the  pitch  period, 
the  adduction  (voicing  onset)  and  abduction  (voicing  decay) 
of  the  vocal  folds,  and  the  area  function.  Several  devices 
and  algorithms  are  available  for  extracting  the  pitch  from  the 
original  speech,  for  example,  the  well-established  "modified- 
autocorrelation"  and  "cepstrum"  methods  (Rabiner  and  Schafer, 
1978;  Hess,  1982),  peak-picking  (Howard,  1989),  zero-crossing 
rate,  etc.  A convenient  technique  for  pitch  detection  is 
elect roglott ography  (EGG) , to  be  commented  on  later  in  this 
section.  The  glottal  area  function  Ag(t)  is  the  most 
important  control  parameter  of  the  source  model  insofar  as  all 
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T : PERIOD 
To  : OPENING  PHASE 
Tc  : CLOSING  PHASE 
To  + Tc  : OPEN  PHASE 

Qo  : OPEN  QUOTIENT  = (To  + Tc)  / T 
Qm  : SPEED  QUOTIENT  = To  / (To  + Tc) 


Figure  1.22  Glottal  area  model  of  Anathapadmanabha 
and  Fant  (1982) . 
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Normalized  Titze  Model  Glottal  Area  Function 


A 

B 

C 

D 


30 

50 

70 

90 


2.0  1.0 
1.5  1.5 

5.0  0.3 

1.0  2.0 


A (t ) = [(©/©^-^[sin^/sin^)]0,  0 < n 

= 0 • 0 / 0 ^ K 

where : 


0 = 7ct  / yT  ©ra  = ny  / (l  + y) 

T : pitch  period 

Y (Qo)  : open  quotient  (0.1-0. 9 typical):  duration  of  the 

glottal  open  phase  to  the  duration  of  the  complete 
glottal  pulse. 

8 (Qs)  : speed  quotient  (0. 5-5.0  typical):  duration  of  the 
glottal  opening  phase  to  the  duration  of  the 
glottal  closing  phase  (Titze' s definition) 

P (SL)  : slope  factor  (0. 7-3.0  typical) : time  constant  of 
the  residue  decay  (percentage  of  pitch  period) 


Figure  1.23  Titze' s area  function  model 
(Pinto  et  al . , 1989) . 
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Qa  : abduction  quotient 
Qs  : shape  quotient 
Qb  : bulging  quotient 
A : unit  amplitude  of  vibration 

Figure  1.24  Three-dimensional  glottal  model:  prephonatory 
configuration  (Titze,  1989)  . 
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the  components  of  the  models  depend  on  it.  Moreover,  since 
naturalness  of  the  synthesized  speech  is  closely  related  to 
the  source  model  (Fant,  1981/  Rothenberg,  1981;  Childers  et 
al.,  1983),  the  importance  of  its  reliable  derivation 
increases . A complete  review  of  the  techniques  to  derive 
glottal  data  is  provided  by  Childers  (1977)  . The  techniques 
that  have  been  more  frequently  used  to  observe  the  vibratory 
patterns  of  the  vocal  folds  and  to  derive  glottal  data  are: 

Stroboscopy . The  original  technique  used  photography, 
providing  only  the  gross  structure  and  movements  of  the  vocal 
folds  (Lecluse,  1975)  . It  has  advanced  in  this  decade  by 
adopting  video  recording  techniques,  which  enables  one  to 
obtain  a slow-motion  view  of  the  folds. 

High-speed  laryngeal  cinematography.  A mirror  is  placed 
in  the  pharynx  to  reflect  the  high-intensity  incident  light 
beam  onto  the  vocal  folds  and  to  reflect  the  image  of  the 
vocal  folds  back  to  a high-speed  motion  picture  camera 
(Childers  et  al . , 1980)  . The  estimated  area  does  not 
correspond  to  that  produced  in  natural  conditions  since  during 
the  process  the  acoustical  system  is  "perturbed"  by  the 
unpleasant  mirror  and  by  the  always-open  mouth.  The  procedure 
is  invasive,  expensive  and  limited. 

Videof iberoptic  nasopharvngoscopy . A small  tube  is 
passed  through  the  nose,  allowing  the  video  recording  of  the 
vocal  folds  (National  Strategic  Research  Plan,  1990) . 
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Photoglottography  (PGG) . A noninvasive  technique.  A 
light  is  emitted  by  an  external  source  placed  on  the  neck  wall 
and  after  passing  through  the  vocal  folds  is  detected  by  a 
sensor,  a multiplier  photo-tube  placed  in  the  larynx,  capable 
of  exciting  an  oscilloscope.  The  displayed  curve  is  then 
related  to  the  vocal  fold  movement  (Kitzing  and  Sonesson, 
1974) . 

Ultrasonoqlottographv  (UGG) . It  is  based  on  the 
relationship  between  ultrasonic  energy  and  lateral  contact 
area  of  the  vocal  folds. 

Electroqlottoqraphv  (EGG) . It  is  an  efficient  indirect 
method  to  derive  the  glottal  area  function  (Childers  and 
Krishnamurthy , 1985)  and  the  pitch.  The  electrical  impedance 
between  two  plate  electrodes,  held  in  contact  with  the  skin  on 
both  sides  of  the  larynx,  varies  periodically  as  the  vocal 
folds  vibrate.  The  glottal  area  is  derived  from  this  variable 
impedance.  Figure  1.25  depicts  the  system  configuration  and 
the  EGG  waveform  (Childers  and  Larar,  1984)  . The  DEGG,  short 
for  differentiated  EGG,  provides  a convenient  way  to  determine 
the  time  events  required  for  deriving  the  glottal  area 
function:  the  pitch  contour  and  the  closed  and  open  glottal 
interval  (Childers  et  al . , 1985) . It  has  been  inferred  from 
many  experiments  that  the  beginning  of  the  open-phase 
corresponds  to  the  maximum  value  of  the  DEGG  and  that  the 
closing  instant  corresponds  to  the  minimum  of  the  DEGG.  These 
results  were  validated  by  Childers  et  al.  (1984),  using  EGG 


Glottal  Area  DEGG  EGG 


42 


RF 

Source 


Figure  1.25 


System  configuration  for  EGG  and  waveform 
(Childers  and  Larar,  1984)  . 


2048  _ 


AB: closed  phase  BC: opening  phase  CD: closing  phase 
Figure  1.26  Glottal  area  from  EGG  (Pinto  et  al . , 1989). 
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and  speech  waveforms  synchronized  to  ultra-high-speed 
laryngeal  films  and  by  Anastaplo  and  Karnell  (1988),  employing 
a simultaneous  EGG/videostroboscopic  technique,  always  for 
nonpathological  voices.  Nevertheless,  two  issues  still  need 
more  investigation:  the  calibration  of  the  transductor  (to 
establish  a reliable  relationship  between  the  EGG  and  the 
vocal  fold  contact  area)  and  the  interpretation  of  the 
relationship  between  contact  area  and  vocal  fold  movement 
(Titze,  1989)  . Figure  1.26  shows  a DEGG  waveform  for  a voiced 
segment  of  speech  (Pinto  et  al . , 1989). 

Inverse  Filtering.  This  method  consists  of  the 
analytical  reconstruction  of  the  excitation  waveform  from  the 
speech  signal.  A linear  prediction  analysis  during  the  closed 
glottal  interval  is  the  standard  procedure  (Wong  et  al . , 
1979) . The  minimum  squared  error  between  the  reconstructed 
waveform  and  the  given  model  waveform  is  the  usual  criterion 
of  optimization  (Yea  et  al . , 1983).  The  detection  of  the 
glottal  closed  phase  can  be  properly  made  by  using  the  EGG 
signal  or  an  adaptive  filtering  technique  (Ting  and  Childers, 
1988)  . 

1.4.2  Vocal  and  Nasal  Tract  Models 

The  vocal  tract  is  an  acoustic  tube  with  special 
characteristics.  It  is  non-uniform  and  slowly  time-varying  in 
cross-sectional  area  and  shape;  it  is  lossy,  having  yielding 
walls,  viscous  friction  and  heat-conduction,  it  has 
fluctuating  boundaries  at  the  lips  and  at  the  glottis,  and  it 
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becomes  two  branches  when  the  velum  is  lowered,  which 
introduces  the  nasal  tract.  A complete  model  does  not  exist 
(Rabiner  and  Schafer,  1978)  and  at  the  present  state-of-the- 
art  would  not  be  worthwhile  because  of  the  required 
computational  burden.  Therefore,  it  is  imperative  that 
simplifications  be  made. 

The  first  general  simplification  consists  of  assuming 
plane  wave  propagation  (Portinoff,  1973;  Klatt,  1980) . This 
assumption  reduces  the  wavefront  propagation  from  three 
dimensions  to  one.  This  simplification  is  negligible  for 
frequencies  under  5 KHz,  owing  to  the  dimensions  of  the  tract. 
Application  of  Newton's  momentum  law  and  continuity  of  mass 
law  to  a nonuniform  and  lossless  tube  results  in  the  equations 
of  Portnoff  (1973)  . Then,  the  acoustic  tube  is  studied  as 
non-bent  (Sondhi,  1986),  uniform  and  lossless.  The  analogies 
between  this  acoustic  tube  and  electrical  transmission  lines 
are  established.  After  that,  the  effects  of  the  yielding 
wall,  heat-conduction,  viscous  friction  and  radiation  losses 
are  added  (Flanagan,  1972a).  Table  1.3  summarizes  the 
acoustic  tube  equations  (Rabiner  and  Schafer,  1978)  . Next, 
several  lossy  uniform  tubes  with  equal  length  and  different 
cross-sectional  areas  are  concatenated  to  simulate  the  real 
nonuniform  and  lossy  vocal  tract.  These  cross-sectional  areas 
provide  a stepwise  approximation  of  the  vocal  area  function. 
A discrete-time  model  can  be  derived,  if  the  vocal  tract 
losses  are  neglected.  The  nasal  tract  is  also  modeled  as  a 
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non-uniform  acoustic  tube,  with  an  internal  shunt  provided  by 
two  sinus  cavities  (Maeda,  1982a;  Fant  et  al . , 1985).  The  two 
frontal  channels  that  end  in  the  nostrils  are  modeled  as  a 
single  channel  with  negligible  error.  The  nasal  tract  itself 
represents  a side  branching  for  the  vocal  tract  (Hecker, 
1962).  Figure  1.27  sketches  the  oral  and  nasal  tracts  and 
their  area  functions.  Some  data  concerning  the  sinus  cavities 
and  the  nasal  tract  were  generated  by  Fant  (1960),  Fujimara 
(1960,  1961,  1962),  Lindqvist  and  Sundberg  (1972),  Maeda 

(1982),  and  Fant  et  al . , (1985). 

1.4. 2.1  Uniform  acoustic  tube 

The  equations  for  a uniform  tube  and  the  modifications 
caused  by  the  losses  are  given  in  Table  1.3.  The  formants  can 
be  mathematically  defined  as  the  poles  of  the  vocal  tract 
frequency  response  (the  relationship  between  the  volume 
velocity  at  the  lips  and  volume  velocity  supplied  by  the 
source) . Each  complex  conjugate  pair  of  poles  represents  a 
peak  in  the  spectrum  (formants) . The  electrical  equivalent  of 
the  lossless  uniform  acoustic  tube,  for  plane-wave 
propagation,  is  a uniform  electrical  transmission  line,  with 
shunt  capacitance  per  unit  length  C and  series  inductance  per 
unit  length  L.  The  "characteristic  acoustic  impedance"  is,  by 
analogy,  defined  as  the  positive  value  of  the  square  root  of 
the  relationship  between  C and  L.  When  the  viscous  friction 
and  the  heat  conduction  are  considered,  the  electrical 
equivalent  circuit  for  an  "elemental  length"  is  that  shown  in 
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TABLE  1.3  ACOUSTIC  TUBE  EQUATIONS 
1.  NONUNIFORM  LOSSLESS  AND  TIME-VARYING  CROSS  SECTION 


Portnoff  Equations 
-dp/8x  = pd [u/A (x, t ) ] /dt 

-9u/9x  = (1/pc2)  3 [pA  (x,  t)  ] /3t  + 3A(x,t)/3t 

where:  p(x,t):  sound  pressure  at  position  x,  time  t. 

u(x,t):  volume-velocity  at  position  x,  time  t. 


2.  UNIFORM  LOSSLESS 

2 . 1 Time  Domain 

-9p/3x= (p/A)  du/dt 
-du/dx=  (A/pc2)  9p/9t 

2 . 3 Freer.  Domain 

-dU/dx  = Y P 
-dP/dx  = Z U 

where:  Z = jwp/A 
Y = jwA/pc2 


2 . 2 Traveling  Waves  Solution 

u (x,  t ) =u+ (t-x/c)  - u"(t+x/c) 
p (x,  t)  = (pc /A)  [ u+  (tx/c)  +u"  (t+x/c)  ] 

2 . 4 Sin.  Steady  State  Solution 

u (x,  t) =cos [w (1-x) /c] VA  UG(w)ejwt 
p (x, t) = jZ0sin [w (1-x)  /c]  VAUG  (w)  ejwt 

Z0=|/z7y=pc/A  VA=l/cos  (wl/c) 

p(l/t)  = 0 u (0,  t)  =UG  (w)  ejwt 


3.  UNIFORM  TUBE  WITH  YIELDING  WALL  AND  NO  OTHER  LOSSES 


3 . 1 Time  Domain 
-9p/9x  = p9  (u/Aq)  /3t 

-du/dx  = ( 1/pc2)  d (pAo)  /9t+3  (8a)  /dt+dhv/dt 

where:  A(x,t)  = Ao(x,t)  + SA(x,t)  nominal  area 

3 . 2 Freg.  Domain 

-dP/dx  = Z'U 
-dU/dx  = Y'P 

where:  Z'  (x,w)  = jwp/A<,(x) 

Y'  (x,w)  = 1/ [ jwm  (x) +b  (x) +k  (x)  / jw]  + jwAo  (x) /pc2 

4.  EFFECT  OF  VISCOUS  FRICTION 

Z (x,  w)  = {S  (x)  / [A<j(x)  ]2}  l/wp|!/2  + jwp/A<5(x) 
is  added  to  Z in  Equ.  2.3  or  to  Z'  in  Equ.  3.2. 

5.  EFFECT  OF  HEAT  CONDUCTION 

Y (x,  w)  = [S  (x)  (TJ-1) /pc2]  \j\vr /2§p  + jwA^xJ/pc2 
is  added  to  Y in  Equ.  2.3  or  to  Y'  in  Equ.  3.2. 

Note:  Symbols  are  given  in  Table  1.4. 
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Fig.  1.28  (Flanagan  et  al.,  1970).  Each  "pipe"  element  is 
represented  by  a T-section  whose  impedances  Za  and  Zb  are 
hyperbolic  functions  of  the  "complex  acoustic  propagation 
constant  y"  and  of  the  length  1 of  the  pipe.  The  elements  L 
(inertance  caused  by  the  mass  of  air  inside  the  pipe)  , C 
(compliance  due  to  the  compressibility  of  air  in  the  pipe) , R 
(viscous  friction)  and  G (heat  conduction)  are  derived  from 
the  expansion  of  the  impedances  Za  and  Zb.  The  addition  of 
the  effects  of  the  wall  vibrations  (yielding  walls)  results 
now  in  the  equivalent  circuit  of  Fig.  1.29.  Table  1.4  gives 
the  physical  definitions  for  this  equivalent  circuit. 

The  effects  of  the  losses  are  the  following  (Rabiner  and 
Schafer,  1978)  : 

Yielding  walls  slightly  raise  the  frequencies  and 
bandwidths  of  the  formants  (Sondhi,  1974)  . The  losses  are 
represented  in  the  equivalent  circuit  (Fig.  1.29)  as  the  wall 
impedance  Z = R„  Lw  Cw/  parallel  to  C.  Low  frequencies  are 
more  affected.  Ishizaka  et  al . (1975)  and  Fant  et  al . (1976) 
also  provided  some  data. 

Viscous-friction  and  heat-conduction  losses  cause  a 
small  decrease  in  the  formant  frequencies  and  a small 
broadening  in  their  bandwidths.  In  the  equivalent  circuit  of 
Fig.  1.29,  the  viscous-friction  loss  is  represented  by  the 
frequency-dependent  resistor  R,  in  series  with  L;  the  heat- 
conduction  loss,  by  the  frequency-dependent  conductance  G, 
parallel  to  C.  High  frequencies  are  more  affected  by  both 
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Figure  1.27  Interconnection 


of  vocal  and  nasal  cavities. 


Za  = Zo  tanh(yl/2) 
Zb  = Zo  csch(yl) 

Zo  = pc/A 
Y = a + j p 


Figure  1.28  Equivalent  circuit  of  a tube  with  losses  due 
to  viscous  friction  and  heat  conduction 
(Flanagan  et  al . , 1970). 
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TABLE  1.4  CIRCUIT  COMPONENTS  FOR  A LOSSY  ELEMENTAL  TUBE 

AND  ASSOCIATED  SYMBOLS 


R 

L 

C 

G 

R. 

L. 

C„ 


ISJqh  (d 

iFLa1 

Ql 

2A 

Ad 

QC1 

(7-iy5  r^_ 

QC1  v 2 $6 
bl_ 

J>2 

ml 

l7 

5V 

k 


where 


; Series  Resistance 
; Series  Inductance 
; Shunt  Capacitance 
; Shunt  conductance 
; Resistance  in  Wall  Impedance 
; Inductance  in  Wall  Impedance 
; Capacitance  in  Wall  Impedance 


I : length  of  element 
p : density  of  air  ( 1.14xl(r3  g.cirf3) 
c : sound  velocity  ( 35300  cm/sec  ) 
p : viscosity  ( 1.86xl0'4  dyne*sec*cm'2  ) 
t)  : adiabatic  gas  constant  ( 1.4  ) 
k : coefficient  of  heat  conduction  of  air 
( 5.5 x 10-5  caUcm'Usec'^deg'1  ) 

£ : specific  heat  ( 2.4x10-*  caUg'Udeg'1  ) 

A : cross-sectional  area  of  element 
S : circumference  of  element 
a)  : radian  frequency 

b : mechanical  resistance  of  wail  per  unit  lenth 
m : mass  of  wall  per  unit  length 
k : stiffness  of  wall  per  unit  length 


Source : Ding  (1990)  . 
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Symbols : Table  1.4 


Figure  1.29  Equivalent  circuit  for  an  elemental  length 

vocal  tract  tube  with  yielding-wall , viscous- 
friction,  and  heat-conduction  losses. 
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Symbols : Table  1.4 


Figure  1.30  Simplified  lip  radiation  model. 
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kinds  of  losses.  The  heat  conduction  and  viscous  friction  in 
the  nasal  tract  are  relatively  larger  than  those  for  the  vocal 
tract,  causing  a greater  broadening  in  the  bandwidths  for 
nasals . 

The  combined  effect  of  these  three  types  of  losses,  as 
compared  with  the  lossless  case,  is  a rise  in  the  bandwidths 
of  all  formants,  a slight  increase  in  the  frequencies  of  the 
first,  second  and  third  formants,  and  a slight  decrease  in  the 


frequencies 

of  the  fourth 

and 

fifth  ones. 

Fant 

(1985) 

provides  a 

useful  equation 

to 

correct  the 

values 

of  the 

formant  frequencies  after  their  estimation  without  accounting 
for  the  losses  and  also  a formula  to  estimate  the  average 
bandwidth  at  each  frequency  of  the  vocal  tract  transfer 
function . 

For  the  same  vocal  tract  shape,  however,  the  radiation  at 
the  lips  is  the  most  important  factor  affecting  the  formants, 
broadening  their  bandwidths  and  lowering  their  center 
frequencies.  The  radiation  losses  are  more  pronounced  at 
higher  frequencies. 

The  net  effect,  considering  all  the  losses,  can  be 
summarized : 

(1)  The  formant  bandwidths  are  broadened  owing  to, 
primarily,  the  yielding  walls  (first  formant) , the  radiation 
losses  (second  to  fifth  formants) , and  the  viscous  friction 
and  thermal  losses  (second  and  third) . 
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(2)  The  frequencies  of  the  formants,  mainly  those  of 
fifth,  fourth,  and  third  experience  a shift  downward  in 
frequency . 

1.4. 2. 2 Radiation  at  lips  and  nostrils 

A model  for  simulating  the  lip  termination  can  be  based 
on  the  radiation  from  a spherical  baffle,  where  the  lip 
opening  is  the  radiating  surface  and  the  head  is  the  baffle 
diffractor  (Morse  and  Ingard,  1968)  . Wakita  and  Fant  (1978) 
proposed  a series  connection  of  a resistance  and  an  inductance 
to  simulate  the  lip  impedance.  A simplification  leads  to  the 
convenient  model  of  an  infinite  plane  baffle  (at  low 
frequencies  and  with  the  lip  opening  much  smaller  than  the 
head).  The  equivalent  circuit,  shown  in  Fig. 1.30,  is  the 
high-pass  filter  represented  by  the  parallel  connection  of  a 
radiation  conductance  Gr  and  a radiation  susceptance  Sr 
(Flanagan,  1972a) . A similar  model  is  used  for  the  radiation 
impedance  at  the  nose,  with  the  only  difference  being  in  the 
effective  radius  of  the  orifices.  For  the  mouth,  the  radius 
depends  on  the  area  at  the  lips,  while  for  the  nose  it  is 
assumed  constant  (Rubin  et  al . , 1981). 

At  normal  atmospheric  pressures,  the  radiation  through 
the  walls  of  the  throat  is  usually  negligible,  except  for 
voiced  stops.  Nevertheless,  in  hyperbaric  atmosphere  (diver's 
speech) , this  radiation  turns  out  to  be  relatively  large 
because  the  vocal  tract  walls  become  less  rigid  (Hisayoshi  et 
al . , 1986)  . For  voiced  stops,  the  radiation  through  the  walls, 
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before  the  release  of  the  pressure,  can  be  simulated  by  an 
impedance  in  each  elemental  length  of  the  vocal  tract 
(Flanagan  and  Ishizaka,  1975) . 

1 . 4 . 2 . 3 Hearing 

The  final  target  of  the  synthesizer  is  the  human  ear; 
hence,  the  sound  pressure  at  a given  distance  must  be 
determined.  A differentiator  model  can  satisfactorily 
estimate  the  sound  pressure  from  the  volume  velocity,  assuming 
a uniform  radiation  in  all  the  directions.  For  nasals,  the 
resultant  volume  velocity  is  the  sum  of  that  in  the  lips  and 
that  in  the  nostrils. 

1.4. 2. 4 Noise  source 

The  mechanism  of  turbulence  can  be  modeled  to  allow  the 
synthesis  of  fricatives  and  plosives.  As  presented  in  Section 
1.3.2,  when  the  air-flow  passes  with  high-enough  velocity 
through  a constriction  (exceeding  a threshold  Reynolds 
number),  turbulent  noise  is  created.  The  cavity  that  is 
formed  in  front  of  the  constriction  requires  a more  accurate 
description  than  that  behind  the  constriction  (Fant,  1960) . 

A model  due  to  Flanagan  et  al.  (1975)  represents  the  noise 
source  in  each  element  of  the  transmission  line  as  a sound 
pressure  Pn  in  series  with  a resistance  Rn  (inherent 
constriction  loss) . Sondhi  and  Schroeter  (1987)  proposed  a 
parallel  configuration  for  this  model,  to  improve  the 
generation  of  unvoiced  sounds.  They  simply  converted  the 
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"voltage  source"  into  a "current  source"  and  neglected  the 
shunt  resistance  Rn.  For  frication,  this  "short-circuit 
current  source"  was  placed  at  the  section  anterior  to  the 
outlet  of  the  narrowest  constriction.  For  aspiration,  the 
noise  source  was  introduced  at  the  glottis.  A representation 
of  this  model  and  the  expressions  of  the  components  are  shown 
in  Fig.  1.31. 

The  best  location  of  the  noise  source  needs  more 
investigation.  Shadle  (1985)  reported  that  the  placement  of 
the  noise  source  may  depend  on  the  particular  articulation. 
The  alternatives  are:  center  of  the  constriction,  downstream 
or  upstream  constriction,  spread  along  a certain  interval  or 
finally,  a combination  of  locations.  For  voiced  fricatives 
the  voiced  excitation  modulates  the  noise  source.  For 
aspirated  sounds  the  noise  source  is  placed  at  the  glottis. 
Figure  1.32  shows  the  block  diagram  for  the  synthesis  of 
fricatives  and  Fig.  1.33  shows  the  diagram  for  voiced  and 
aspirated  sounds  (Rubin  et  al.,  1981). 

1.4. 2. 5 Concatenation  of  tubes 

The  vocal  tract  is  modeled  by  a concatenation  of  uniform 
tubes  (elemental  lengths)  whose  cross-sectional  areas 
approximate  the  area  function  of  the  vocal  tract.  The  length 
of  each  element  must  be  much  smaller  than  the  speech 
wavelengths;  hence,  a minimum  of  20  areas  are  necessary  (Atal 
et  al.,  1977)  . 
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Pn  : sound  pressure  source,  proportional  to  Re 
Rn  : source  inherent  resistance 

Zx  : glottal  input  impedance  (seen  at  constriction) 
Z2  : lip  input  impedance  (seen  at  constriction) 

Re  : Reynolds  number 

Un  : short-circuit  noise  flow 


Figure  1.31  Noise  source  models  (Flanagan  et  al.,  1975). 
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Two  major  approaches  can  be  used: 

Electronic  network  or  lumped  transmission-line.  The 
first  approach  represents  each  concatenated  tube  as  the 
element  of  the  analog  transmission-line  shown  in  Fig. 1.29. 
The  inductances,  capacitances  and  resistances  in  the  model  are 
dependent  on  the  length  of  the  element  and  on  the  cross- 
sectional  area  of  the  element,  as  shown  in  Table  1.4.  They 
represent  the  inertance,  the  compliance  and  the  losses 
associated  with  each  element.  A finite  number  of  concatenated 
elements  of  the  transmission-line,  an  excitation  glottal- 
source  for  voiced  sounds,  a noise  source  for  fricatives  and 
plosives  near  the  point  of  maximum  constriction  (or  at  the 
glottis  to  produce  aspirated  sounds)  and  a radiation  load 
compose  the  entire  model.  A block  diagram  is  shown  in  Fig. 
1.34  (Maeda,  1982a)  and  a network  representation  is  given  in 
Fig.  1.35  (Flanagan  et  al . , 1970).  The  last  step  is  to  apply 
Kirchoff ' s Laws  and  solve  the  set  of  differential  equations 
that  describe  the  network,  by  using  appropriate  tools,  namely, 
stable  numerical  methods  (implicit  methods  as  the  backward 
Euler,  trapezoidal  algorithm,  etc.)  and  computer-aided 
analysis  of  electronic  circuits.  The  transfer  function  of  the 
vocal  tract  can  be  numerically  calculated  (Flanagan,  1972a; 
Atal  et  al . , 1979)  and  so,  the  formant  frequencies  and 
bandwidths  can  be  obtained.  The  coupling  of  the  vocal  and 
nasal  tract  during  the  production  of  nasals  is  represented  by 
the  parallel  association  of  the  two  sections  in  Fig.  1.35. 
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Figure  1.32  Block  diagram  for  synthesis  of  fricatives 
(Rubin  et  al . , 1981) . 
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Figure  1.33  Block  diagram  for  synthesis  of  voiced  and 
aspirated  sounds  (Rubin  et  al . , 1981). 

(a)  first  version. 

(b)  equivalent  circuit,  with  Norton 
equivalent  of  leftmost  block  of  (a) . 
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Neither  the  values  of  the  electrical  equivalent  components  nor 
the  optimum  number  of  elements  has  been  mastered  (Wakita  and 
Fant,  1978)  . 

Reflection  coefficients  and  discrete-time.  The  second 
approach  is  to  neglect  the  losses  caused  by  viscous  friction, 
heat-conduction  and  yielding  walls  and  to  use  the 
traveling-wave  equations  (Kelly  and  Lochbaum,  1962)  in  order 
to  derive  simpler  analog  models  and  a direct  digital  model. 
The  application  of  the  boundary  conditions  at  the  junction  of 
two  tubes  (Rabiner  and  Schafer,  1978)  leads  to  the  convenient 
representation  provided  by  the  reflection  coefficients  rt  and 
by  the  propagation  delays  x.  Reflection  coefficients  are  also 
established  for  the  terminations  at  the  lips,  rlf  and  at  the 
glottal  end,  rg  (Rubin  et  al . , 1981)  . An  equivalent 
discrete-time  system  can  be  derived,  by  considering: 

(1)  Equal-length  concatenated  lossless  tubes. 

(2)  Each  propagation  delay  x is  equivalent  to  a half- 
sample delay,  that  is,  T = 2 x , where  T is  the  sampling 
period. 

(3)  A proper  combination  of  the  propagation  delays  in 
the  upper  and  lower  branches  of  the  ladder  model  so  as  to 
express  each  section  by  the  desired  whole  sample  delay  z_1 

(4)  The  validity  of  the  model  only  for  the  band  of 
frequencies  under  F = 1/  (2T) , where  T is  the  sampling  period. 
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Figure  1.34  Block  diagram  for  a lumped  transmission-line 
synthesizer  model  (Maeda,  1982a) . 
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Figure  1.35 


Network  for  the  lumped  transmission-line 
synthesizer  model  (Flanagan  et  al.,  1970). 
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(5)  The  losses  due  to  the  glottis  and  lips  are 
considered  by  the  proper  choice  of  the  reflection  coefficients 
at  the  two  terminations. 

These  models  possess  the  advantages  brought  about  by 
digital  filters:  greater  speed,  easier  implementation  of 
resonators  (no  "higher  pole  correction")  and  even  the 
accomplishment  of  real-time  synthesis  (Meyer,  1989)  . 
Otherwise,  it  is  supposed  that  to  neglect  the  losses  and  the 
source-tract  interaction  can  represent  a significant  departure 
from  the  ultimate  goal  of  high-quality  and  natural-sounding 
synthetic  speech.  Some  research  has  been  done  to  include  the 
effects  of  losses  (Kabasawa  et  al . , 1983)  but  more 
investigation  is  still  required. 

1.4. 2. 6 Data  for  vocal  tract  models 

The  main  feature  of  the  vocal  tract  models  is  the  cross- 
sectional  area  function,  which  depends  on  the  position  of  the 
articulators.  The  methods  for  deriving  the  cross-sectional 
area  can  be  grouped  in  two  major  classes:  "direct"  and 
"indirect"  methods.  The  "direct"  methods  determine  the  cross- 
sectional  area  by  measuring  the  vocal  tract  with  special 
devices  or  techniques . The  high-speed  cineradiography 
(Perkel,  1969)  and  the  X-ray  microbeam  technique  (Kiritani, 
1986),  for  example,  measure  the  sagittal  distances  of  the 
vocal  tract,  from  which  the  cross-sectional  area  function  can 
be  estimated.  Techniques  for  deriving  the  area  function  from 
the  sagittal  distances  are  described  by  Fant  (1960)  and 
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Johansson  et  al.  (1983).  The  "indirect"  methods  try  to 
extract  the  area  function  from  the  original  speech  using  an 
analytical  process;  for  example,  inverse  transform  of  the 
speech  signal  (acoustic-to-articulatory  transformation) , 
acoustical  pulse  reflections,  feedback  or  adaptive  synthesis, 
etc.  The  knowledge  accumulated  by  measurements  and  analyses 
constitutes  the  set  of  articulatory  codebooks  and  rules  for 
articulatory  synthesis. 

X-ray  photography  or  high-speed  cineradiography.  This 
method  has  been  utilized  to  obtain  the  area  function  (Perkell, 
1969),  but  was  abandoned  by  virtue  of  the  known  problems  of 
the  accumulated  exposure  to  radiation.  Nevertheless,  much  of 
the  research  in  articulatory  synthesizers  still  depends  on  the 
sparse  data  extracted  from  X-ray  films. 

X-ray  microbeam.  To  minimize  the  radiation  dosage,  an  X- 
ray  microbeam  generator  tracks  the  position  of  metal  pellets 
attached  to  the  articulators  of  a subject,  who,  of  course, 
cannot  have  any  dental  metallic  fillings,  caps  or  bridges 
(Kiritani,  1986) . 

Computer  tomography  (CT) . it  is  based  on  radiographic 
imaging  techniques.  Results  are  provided  by  Johansson  et  al. 
(1983) . 

Magnetic  resonance  imaging  (MRI ) . The  latest  technique 
to  measure  the  vocal  tract  sagittal  distances. 

Acoustic-to-articulatory  transformation . The  first 
contribution  to  the  evaluation  of  the  vocal  tract  area 
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function  from  the  speech  waveform  was  by  Schroeder  and 
Mermelstein  (1965),  who  used  a first-order  perturbation 
analysis  technique  relating  speech  features  and  articulatory 
parameters.  The  mapping  from  the  analyzed  speech  to  the 
cross-sectional  area  is  not  unique  and  depends  on  the  boundary 
conditions  (Charpentier,  1984;  Atal,  1978)  . A natural  way  to 
reduce  the  ambiguity  is  to  impose  constraints  (derived  from 
the  physiology  of  the  vocal  tract)  on  the  expected  area 
functions.  This  inverse  mapping  can  be  unique  for  specific 
loss  distributions  in  the  vocal  tract  (Atal  et  al.,  1978)  . To 
illustrate  the  inverse  mapping  problem,  let  x be  the  vector 
that  describes  the  configuration  of  the  articulators  (using 
areas  or  other  parameters) , y the  vector  that  describes  the 
acoustic  signal  (using  LPC  coefficients  or  other  parameters) 
and  f a function  such  that  y=f  (x)  . The  inverse  transformation 
x=g (y)  is  multivalued,  that  is,  given  one  y,  several  vectors 
x can  be  found.  For  a lossless  acoustic  tube  with  ideal 
terminations  and  length  L,  "the  functions  A(d)  and  l/A(L-d) 
give  the  same  transfer  function"  (Sondhi,  1979,  p.  269)  . This 
means  that,  by  reversing  the  position  of  the  glottis  and  the 
lips  and  by  inverting  the  area  values,  a pair  of  area 
functions  are  found  that  have  the  same  transfer  function 
(Charpentier,  1984)  . Besides,  for  some  y,  x may  be 
physiologically  impossible  or  may  not  exist.  The  regions  in 
the  x space  that  produce  the  same  y's  are  called  fibers. 
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Other  problems  in  the  inverse  transformation  are  related 
to  the  limitations  of  the  model,  which  disregard  the  side 
branching  introduced  by  the  nasal  cavity  (for  nasals  and 
nasalized  sounds) , the  effect  of  the  vocal  tract  losses,  and 
the  variability  of  the  source  characteristics. 

Two  basic  approaches  have  been  used  to  carry  out  the 
inverse  mapping:  linear  prediction  and  terminal  impedance. 
The  first  approach,  linear  prediction/acoustic  tube  model 
(LPAT  model),  developed  by  Wakita  (1973,  1979),  uses  the  known 
relationship  between  the  LPC  partial  correlation  coefficients 
ki  (PARCOR)  and  the  reflection  coefficients  rL  of  the  lossless 
acoustic  tube  (Rabiner  and  Schafer,  1978) . 

The  other  usual  approach  consists  of  deriving  the  area 
function  from  the  poles  and  zeros  of  the  terminal  impedance  of 
the  vocal  tract  (driving  point  impulse  response).  Sondhi' s 
method  (Lip  Impulse  Response)  requires  an  impedance  tube  that 
must  be  gripped  by  the  subject's  lips,  kept  thus,  permanently 
closed  (Sondhi  and  Gopinath,  1971;  Sondhi,  1979;  Sondhi  and 
Resnick,  1983)  . The  method  of  Wakita  and  Gray  (1975)  tries  to 
estimate  the  lip  impedance  zeros  and  poles  directly  from 
speech  and  then,  noninteractively,  derives  the  area  function 
from  these  zero  and  pole  frequencies. 

Some  variants  of  the  basic  approaches  and  other  methods 
have  been  tried  in  order  that  the  inverse  transform  could  be 
used  as  a reliable  tool  to  estimate  the  area  functions  from 
the  acoustic  measurements.  Ladefoged  et  al . (1978)  obtained 
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two  tongue  parameters  plus  lip  rounding  from  the  first  three 
formants,  using  the  factor  analysis  of  tongue  tract  shapes  by 
Harshman  et  al . (1977)  . The  numerical  method  of  Atal  et  al. 

(1978)  is  based  on  a computer  sorting  and  table  look-up 
procedure  (piecewise  linear  approximation  to  the  acoustic-to- 
articulatory  function).  Milenkovic  and  Muller's  approach 
(1984)  searches  for  uniqueness  in  Wakita's  method  by  starting 
from  the  low  order  pole  and  zero  frequencies  of  the  vocal 
tract  driving  point  impedance.  This  method  requires  the  use 
of  a throat  accelerometer,  a free  field  microphone  and  a 
spectrum  analyzer  but  does  not  need  the  previous  determination 
of  the  vocal  tract  length.  It  is  restricted  to  the 

reconstruction  of  sustained  vowels . A recent  approach  is  to 
use  an  artificial  neural  network,  through  a supervised 
learning  model,  in  order  to  obtain  the  initial  articulatory 
parameters  from  the  acoustic  signal  (Xue  et  al.,  1990). 

Acoustical pulse  reflection.  This  technique,  largely 

employed  in  Geophysics,  consists  of  the  evaluation  of  the 
geometry  of  a nonuniform  acoustic  medium  by  using  acoustic 
reflection.  Milenkovik  (1987)  describes  a time-domain  method 
for  inferring  the  shape  of  the  vocal  tract  driven  by  noncausal 
excitation.  The  reconstruction  of  the  area  requires  the 
measurement  and  analysis  of  the  incident  and  the  reflected 
wave  components,  entering  and  leaving  the  acoustic  tube, 
respectively . 
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Feedback  systems.  These  approaches  try  to  optimize  the 
values  of  the  cross-sectional  area  by  using  a distance  measure 
between  features  of  the  original  utterance  and  those  of  the 
synthesized  speech,  and  a strategy  for  adaptation.  At  the 
present,  neither  a computationally  efficient  scheme  nor  a 
standard  procedure  is  available.  The  preservation  of  the 
spectral  properties  can  be  evaluated  by  comparing  the 
spectrograms  for  the  original  and  synthesized  signal. 
Flanagan  et  al.  (1975)  used  the  minimum  squared  error  between 
the  synthesized  spectrum  and  the  original  spectrum  as  the 
distance  for  adaptation.  Later,  in  1980,  the  same 
researchers  (Flanagan  et  al . , 1980)  used  the  logarithm  of  the 
magnitude  of  the  Fourier  Transform  for  both  synthetic  and 
original  signals  to  adapt  the  source  and  tract  parameters. 
For  the  cord  tension  parameter,  they  used  the  time  difference 
between  the  maxima  of  the  cepstrum  of  the  synthetic  and 
original  signal.  Levinson  and  Schmidt  (1983)  presented  an 
adaptive  computation  of  the  articulatory  parameters,  using  the 
minimum  squared  error  between  the  spectra.  Schroeter  et  al . 
(1987)  used  a distance  that  is  affected  only  by  the  variations 
of  the  vocal  tract  shape.  Fant  (1985)  came  up  with  an 
algorithm  for  deriving  the  log-magnitude  transfer  function 
directly  from  the  area  function,  enabling  a faster  feedback 
for  the  trial-and-error  process. 
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1 • 5 Applications  of  Articulatory  Synthesizers 

Speech  applications  are  motivated  by  three  major 
concerns : 

(1)  To  provide  a natural,  comfortable  and  reliable 
instrument  for  man-machine  direct  interaction. 

(2)  To  aid  the  vocally  or  visually  handicapped. 

(3)  To  improve  voice  communication  systems. 

One  goal  of  a man-machine  interaction  system  is  to 
implement  a machine  capable  of  "understanding"  voice  commands, 
"executing"  them  and  finally  "voice-responding."  Voice 
response  has  already  been  achieved  in  systems  with  typed-text 
input.  Text-to-speech  systems  can  be  considered  at  the 
present  time  as  the  most  successful  and  representative 
application  of  speech  synthesis  (Klatt,  1987) . 

Some  support  for  social  needs  is  also  provided  by  text- 
to-speech  systems : talking  aids  for  the  vocally  impaired, 
reading  aids  for  the  visually  handicapped,  training  aids,  etc. 
(Klatt,  1987) . In  this  context,  an  important  application  to 
be  pointed  out  is  the  modeling  of  vocal  disorders,  aimed  at 
improving  patient  care  (Childers,  1988)  . 

The  applications  toward  the  improvement  of  voice  communi- 
cation systems  are  related  to  coding,  transmission,  storage 
and  encryption  of  speech  signals. 

How  do  the  articulatory  synthesizers  fit  in  this  context? 
Being  closely  related  to  the  human  vocal  system,  the 
articulatory  synthesizers  are  potentially  capable  of 
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producing,  at  low  bit  rates,  a synthetic  speech  that  sounds 
more  natural  and  with  higher  quality  than  that  provided  by 
formant  or  LPC  synthesizers.  Moreover,  the  control  signals  of 
articulatory  synthesizers  are  promising  candidates  for 
achieving  low  bit  rate  coding  of  speech  and  their  interpolated 
values  are  always  physically  realizable  (Sondhi  and 
Schroeter,  1987) . In  speech  recognition,  the  difficulties 
associated  with  the  coarticulation  compensation  and  speaker 
adaptation  are  expected  to  be  overcome  if  the  articulatory 
movements  can  be  reliably  estimated  (Shirai  and  Kobayashi, 
1985;  Kobayashi  et  al.,  1991). 

The  parameters  of  the  LPC  synthesizers  are  not  related  to 
the  physiology  of  speech  production.  The  all-pole  model  makes 
it  difficult  to  generate  nasals,  stops  and  fricatives. 

Automatic  formant  tracking,  mainly  for  the  female  voice, 
is  a difficulty  that  formant  synthesizers  still  face,  along 
with  the  quality  of  synthetic  nasals  and  fricatives.  Source- 
tract  interaction  is  often  disregarded  in  both  LPC  and  formant 
models . 

Otherwise,  articulatory  synthesizers,  although  not 
suitable  yet  for  commercial  applications  by  virtue  of  the 
great  deal  of  computation  required  and  by  the  lack  of 
sufficient  data,  can  properly  deal  with  these  issues.  In 
fact,  it  is  expected  that  articulatory  synthesizers  will  turn 
out  to  be  an  important  tool  for  gaining  insight  into  the  key 
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factors  that  affect  the  quality  and  naturalness  of  synthetic 
speech. 


1 . 6 Research  Goals 

Despite  their  great  potential,  articulatory  synthesizers 
need  to  develop  several  points.  First,  they  are  related  to  an 
acoustic  theory  based  primarily  on  the  physiology  of  the  human 
vocal  system.  The  vocal  system  is  a nonuniform,  slowly  time- 
varying,  lossy  acoustic  tube.  A model  capable  of  reflecting 
completely  the  laws  of  physics  (mass,  momentum  and  energy 
conservation,  fluid  mechanics  and  thermodynamics)  that  govern 
speech  production  is  extremely  difficult  and  complex  (Rabiner 
and  Schafer,  1978)  . 

Additionally,  the  tradeoff  between  the  "sophistication  of 
the  model"  and  the  "computational  efficiency"  must  be 
carefully  considered. 

The  most  significant  difficulty  to  overcome  is  the  lack 
of  data  and  knowledge  about  the  articulatory  movements  and 
vocal  tract  area  function.  The  traditional  methods  based  on 
cineradiography  are  limited  by  the  long  processing  and  mainly 
by  the  danger  of  overexposure  to  radiation  (Perkell,  1969; 
Kiritani,  1986)  . The  analytical  approaches  have  to  face  the 
nonuniqueness  of  the  mapping  from  speech  features  to  vocal 
tract  parameters  (Charpentier,  1984;  Atal,  1978) . 
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Some  issues  that  require  more  investigation  are: 

Glottal  excitation.  The  extraction  of  glottal  parameters 
is  a fundamental  issue.  The  glottal  excitation  waveform  is 
closely  related  to  the  naturalness  of  voice  and  the  source- 
tract  interaction  is  a major  factor.  Some  points  to  be 
considered  are  the  effect  of  adopting  a simplified  glottal 
model  in  terms  of  the  compromise  "computer  time  versus  quality 
of  speech, " the  vertical  phasing  of  the  vocal  fold  vibration, 
the  optimum  number  of  parameters,  the  supraglottal  loading  for 
interactive  models,  the  effect  of  the  subglottal  system,  the 
different  excitations  for  male  and  female  voice,  pathological 
voice  excitations,  etc. 

Vocal  and  nasal  tract.  To  constitute  a reliable  set  of 
articulatory  codebooks  from  the  original  speech  features  is  a 
great  challenge  to  researchers.  In  a target-oriented 
synthesizer,  the  question  that  arises  is  which  interpolation 
approach  provides  the  proper  derivation  of  dynamic  transitions 
between  targets.  The  optimum  number  of  sections  for  the 
stepwise  approximation  of  the  cross  sectional  area  and  the 
values  of  the  lumped  electrical  components  are  not  yet  well 
defined  (Wakita  and  Fant,  1978) . The  best  placement  for  the 
noise  sources  that  simulate  fricatives  and  plosives,  the 
effect  of  the  sinus  cavities,  the  effect  of  the  radiation 
through  the  walls,  and  the  source-tract  interaction  also  need 
more  investigation. 
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Articulatory  synthesizer  mathematical  approach.  The 
tradeoff  of  "quality  versus  computational  efficiency"  for  the 
major  approaches  (time-domain,  wave  digital  filter,  hybrid 
time-frequency  domain,  and  frequency  domain)  needs  to  be 
better  evaluated.  The  analog  models,  for  simplicity's  sake, 
normally  neglect  the  viscous  friction  and  heat-conduction 
losses.  A question  then  arises:  can  the  efficient  digital 
resonators  provide  a reasonable  "degree"  of  speech  quality  if 
excited  by  a proper  waveform,  even  considering  a lossless 
vocal  tract?  Then,  how  far  is  this  level  of  quality  from 
that  of  the  analog  model? 

To  tackle  all  these  issues  in  the  same  dissertation 
would  be  too  ambitious.  Therefore,  this  work  will  follow  the 
priorities  settled  by  the  research  plan  for  speech  synthesis 
in  the  "Mind-Machine  Interaction  Research  Center, " shown  in 
Figure  1.36.  Double-line  blocks  refer  to  the  objectives  of 
the  present  effort. 

The  goals  were  established  as: 

First  goal : To  implement  the  "Articulatory  Model"  as  an 
interactive  graphic  edit  . The  articulatory  model  generates 
the  area  function  for  the  vocal  tract  and  also  can  feed  the 
formant  synthesizer  with  the  formant  values  (Fig.  1.36)  . The 
expected  results  must  fulfill  the  following  specifications: 

(1)  Implementation  for  workstations  and  personal 
computers . 
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(2)  Interaction  and  flexibility:  Each  articulatory 
parameter  must  be  able  to  be  entered  interactively  (using 
keyboard  or  mouse)  or  the  entire  configuration  may  be  provided 
by  table-look-up.  The  system  must  be  menu-driven  and  provide 
on-line  help.  The  researcher  may  easily  alter  the  position  of 
any  articulator. 

(3)  Drawing  and  editing  aids:  the  system  must  provide 
various  drawing  and  editing  tools,  e.g.,  windowing,  zoom, 
scaling,  rotation,  cross-hair  coordinates,  interpolation, 
colors,  fonts,  measuring  of  distances  and  angles,  etc. 

(4)  Display:  The  realization  must  be  able  to  display 
tridimensional  area  functions  corresponding  to  target 
configurations,  their  related  contour  of  formants  (first  to 
fourth) , the  position  of  the  constriction  and  cross-sectional 
area . 

(5)  Files  : The  system  must  be  able  to  import  and  export 

files . 

Second  goal:  To  develop  the  acoustical  model,  formed  by 
the  glottal  source,  vocal  and  nasal  tract,  and  radiation. 
Guided  by  the  major  purpose  of  achieving  high-quality  and 
natural-sounding  synthesized  speech  and  by  the  importance  of 
experiments  in  voice  conversion  and  vocal  disorders,  the 
natural  choice  is  the  time-domain  articulatory  synthesizer 
(see  discussion  on  page  21,  Section  1.3.3).  The  glottal 
source  model  must  provide  a good  quality  for  the  synthetic 
speech  allied  to  a low  computational  burden.  A modified  two- 
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mass  model  is  the  initial  hypothesis,  since  preliminary 
experiments  reveal  that  it  is  convenient  for  simulating  an 
abnormal  vocal  fold  vibration. 

For  the  tracts,  the  research  will  achieve  a robust 
algorithm  for  solving  the  difference  equations  that  describe 
the  model.  The  hypotheses  to  be  tested  are  the  techniques  of 
"computer-aided  analysis  of  electronic  circuits"  associated 
with  the  trapezoidal  algorithm.  Source-tract  interaction  and 
the  effects  of  neglecting  losses  will  be  byproducts. 

Third  goal:  Inverse  mapping.  The  proposed  scheme  is  a 

multidimensional  optimization  technique,  using  gradient  search 
and  linear  successive  approximation.  The  hypothesis  is  that 
these  techniques  can  address  the  acoustic-to-articulatory 
transformation  issue  with  reasonable  results,  providing 
reliable  articulatory  configurations.  It  is  expected  that  the 
formant  contour  provided  by  the  formant  synthesizer  and  the 
speech-derived  formant  frequencies  can  generate  an  adequate 
objective  function  to  the  optimization  procedure.  Rules, 
conditions  and  restrictions  will  be  pursued. 

Fourth  goal:  Experiments . The  articulatory  synthesizer 
will  be  used  as  a tool  for  two  experiments,  aimed  at  finding 
cues  and  correlates  between  the  control  parameters  and  the 
particular  modified  voice. 
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CODEBOOK:  PARAMETERS  & RULES 


Figure  1.36  Research  in  Speech  Synthesis  in  the  Mind- 

Machine  Interaction  Research  Center. 
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Experiment  1.  male/female  voice  conversion  (Pinto  et  al . , 
1989) , will  consist  basically  of  the  assessment  of  the  effects 
of  varying  several  parameters  on  the  synthesis  of  a target 
voice : 

(1)  Variations  of  the  glottal  parameters:  fundamental 
frequency  FO  (high  FO  is  considered  the  strongest  cue  of 
female  voice;  Klatt,  1987),  glottal  area  function  (female 
glottal  area  is  more  "symmetrical;"  Kitzing  and  Sonerson, 
1974) , rest  area  (female  voices  are  supposed  to  be 
"breathier" ) , dimensions  of  vocal  folds  (lengths  for  female 
are  shorter;  Borden  and  Harris,  1984),  etc. 

(2)  Changes  in  the  vocal  tract  area  function,  mainly  in 
the  pharyngeal  region. 

(3)  Changes  in  the  vocal  length  (smaller  for  female; 
Fant,  1973)  . 

(4)  Insertion  of  turbulent  noise  at  the  glottis  to 
simulate  breathiness,  a correlate  of  female  voices  (Klatt, 
1987)  . 

Experiment  2.  simulation  of  pathological  voices  (Lee  and 
Childers,  1989),  will  try  to  evaluate  the  extent  by  which  a 
control  parameter  can  affect  a modal  voice,  converting  it  to 
an  abnormal  one.  The  pathology  is  related  mainly  to  the 
abnormal  vocal  fold  vibration  (Chang,  1989) , but  also  can  be 
originated  by  problems  in  the  tracts  (blockages,  bad  position 
of  articulators,  cleft  palate,  etc) . The  final  objective  in 
the  MMIRC . is  to  quantify  and  classify  the  vocal  disorders 
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caused  by  variations  in  the  glottis  and  in  the  tracts 
(Childers,  1988)  . 

The  basic  tasks  to  be  tested  in  the  conversion  to  creaky 
voice  (vocal  fry)  are  the  reduction  of  the  fundamental 
frequency,  the  shaping  of  the  glottal  area  (skewed  to  the 
right,  with  low  open  quotient,  and  with  an  abrupt  closure), 
and  the  reduction  of  the  intensity  contour  (Pinto  et  al . , 
1989) . 

The  experiments  will  be  conducted  on  a one-parameter 
basis  and  then  with  a proper  association.  The  evaluation  will 
be  performed  by  expert  listeners. 

The  sequence  of  steps  of  the  research  plan  is  given  in 
Fig. 1.37.  The  main  concern  in  "Step  1"  is  the  implementation 
of  a basic  tool;  in  "Step  2"  and  "Step  3,"  the  achievement  of 
a high-quality  synthesis,  and  in  "Step  4,"  the  validation  of 
the  synthesizer. 

Therefore,  in  short,  the  purpose  of  this  research  effort 
is  to  achieve  an  articulatory  synthesizer  which,  besides  being 
flexible  and  robust,  can  provide  the  highest  quality  possible 
for  reasonable  computational  efficiency. 

1 • 7 Description  of  Chapters 

Chapter  2,  Articulatory  Synthesizer  Model,  describes  the 
development  of  the  articulator  model  and  of  the  acoustic 


model . 
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Chapter  3,  The  Inverse  Mapping,  is  concerned  with  the 
derivation  of  the  vocal  tract  area  function  from  the  original 
speech  and  with  the  schemes  for  achieving  the  optimization. 

Chapter  4,  Validation  and  Experiments,  reports  the 
results  of  the  experiments  in  simulation  of  pathological 
voices  and  male/female  conversion. 

Finally,  Chapter  5,  Conclusions  and  Suggestions,  presents 
the  results,  the  accomplished  contributions  and  suggests  other 
topics  for  further  investigation. 
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SUGGESTION 
FOR  NEXT 
RESEARCH 


Figure  1.37 


Plan  of  Research:  Sequence  of  Steps. 


CHAPTER  2 

ARTICULATORY  SYNTHESIZER  MODEL 
2 . 1 Introduction 

This  chapter  describes  the  articulatory  model 
implementation  and  presents  the  realizations  for  the  glottal 
source,  the  vocal  tract,  the  nasal  tract,  and  the  radiation 
models.  The  articulatory  model  presented  here  is  based  on 
Mermelstein' s model  and  takes  advantage  of  computer  aided 
design  techniques  and  LISP  routines.  The  model  can  work 
either  in  an  interactive  basis  or  in  a single  step  mode.  The 
acoustic  model  is  implemented  in  the  time-domain,  not  only  to 
provide  the  proposed  experiments  with  the  proper  variables  but 
also  to  improve  the  quality  of  the  synthesized  speech. 

2 . 2 Articulatory  Model 

2.2.1  Articulators 

The  articulatory  model  estimates  the  cross-sectional  area 
of  the  vocal  tract  for  each  configuration  shaped  by  the 
articulators.  Each  configuration  represents  a target  for  the 
articulatory  synthesizer  and  intermediate  positions  are 
estimated  by  interpolation.  The  articulatory  model 

realization  is  based  on  Mermelstein' s (1973)  model,  because 
that  model  achieved  a good  match  between  the  model-generated 
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midsagittal  vocal-tract  outline  and  measurements  from  X-ray 
tracings . 

The  articulatory  parameters  are  (Fig.  2.1): 

Jaw.  It  is  represented  by  the  point  J whose  polar 
coordinates  with  respect  to  the  reference  point  F are  (SJ, 
TETJ) . For  most  phonemes  the  distance  SJ  is  kept  constant. 

Tongue  body.  It  is  represented  by  an  arc  of  circle  D-B 
whose  center  TONGC  has  polar  coordinates  (SC, TETJ+TETC)  with 
respect  to  the  point  F.  The  radius  of  the  arc  is  constant. 
For  the  inverse  mapping  (Chapter  3)  rectangular  coordinates 
TOX, TOY  are  used. 

Tongue  blade.  It  is  represented  by  the  arc  B-T.  The 
location  of  B depends  on  the  tongue-body  center  and  on  the  jaw 
angle.  The  point  T,  the  tongue  tip,  is  specified  by  the 
rectangular  coordinates  (TX, TY) . For  vowels,  the  length  of  B- 
T can  be  considered  constant  and  its  angle  with  respect  to  the 
horizontal  can  be  estimated  from  the  jaw  and  tongue-body 
coordinates . 

Hyoid.  The  point  H represents  the  intersection  of  the 
anterior  edge  of  the  epiglottis  with  the  top  edge  of  the  hyoid 
bone,  and  the  point  K represents  the  anterior  extremity  of  the 
larynx.  The  point  P is  on  the  normal  bisector  of  the  segment 
H-D,  tangent  to  the  arc  D-B.  The  hyoid  is  specified  by  the 
parameter  HY,  the  distance  from  P to  the  segment  H-D.  The 
hyoid  and  the  tongue  body  determine  the  anterior  shape  of  the 
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Figure  2.1  Articulatory  model  parameters. 


81 


pharynx  H-P-D,  represented  by  the  straight  line  D-P  and  the 
arc  P-H  in  this  implementation. 

Velum.  The  point  V,  whose  rectangular  coordinates 
are (VX, VY) , defines  the  position  of  the  tip  of  the  uvula, 

Lips . The  lower  lip  is  represented  by  the  point  L7, 
whose  coordinates  HL  and  PPL  are  specified  with  respect  to  the 
top  of  the  lower  incisor  J,  which  in  turn,  depends  on  the  jaw 
position.  The  upper  lip  position  L5  has  the  same  coordinate 
values,  with  respect  to  the  bottom  of  the  upper  incisors  U. 

2.2.2  Area  Function  Estimation. 

A grid  system  intercepts  the  posterior-superior  and  the 
anterior-inferior  outlines  of  the  vocal  tract,  determining  the 
midsagittal  distances  g j.  For  each  region  of  the  vocal  tract 
an  empiric  function  f ( j , g^)  maps  the  midsagittal  distances  to 
cross-sectional  areas  (Heinz  and  Stevens,  1965;  Ladefoged  et 
al  • , 1971;  Mermelstein  et  al.,  1971)  . A correction  factor  Cf 
must  be  used  in  each  mapping  whenever  the  direction  of  the 
wave  propagation  is  not  normal  to  the  midsagittal  segment  gj. 

2.2.3  Implementation 

The  articulatory  model  was  implemented  using  CAD 
techniques  and  LISP  routines.  This  approach  provided  several 
advantages  to  the  implementation:  high  speed,  complete 

automatism,  excellent  visualization,  program  compactness, 
accuracy,  and  flexibility.  The  articulatory  parameters  can  be 
entered  step  by  step  (interactive  graphic  editor  and  mapper) , 
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using  the  cross-hair  cursor  or  the  keyboard,  or  the  model  can 
be  provided  with  all  the  necessary  parameters  in  a file,  in 
which  case  the  area  function  and  the  formant  values  are 
calculated.  This  latter  feature  will  be  presented  in  detail 
in  Section  3.3.4,  as  part  of  the  implementation  of  the  inverse 
mapping.  Default  values  are  used,  when  known.  Values  stored 
in  tables  can  also  be  selected  at  any  step.  All  the  CAD 
utilities  are  available  at  any  time:  erasing,  drawing 
geometric  entities,  measuring  distances  and  angles,  extracting 
coordinates,  inserting  text,  storing  blocks  of  drawings, 
zooming,  printout,  coloring,  layering,  etc. 

The  sequence  for  constructing  the  model  interactively  is 
given  in  Fig.  2.2.  The  macro  commands  are: 

[BEGIN] . The  macro  prompts  the  user  for  the  name,  the 
duration  and  the  reference  number  (used  for  filling  in  the 
table  of  formants)  of  the  target  phoneme  or  configuration. 

[FIXED-ST1 . The  fixed  structure  of  the  vocal  tract 
outline  is  inserted:  periarytenoid  G,  rear  pharyngeal  wall  W- 
G2 , G2-G  and  G-Gl,  anterior-inferior  pharyngeal  wall  H1-H2-K, 
highest  point  on  the  maxilla  M,  upper  incisors  U,  hard  palate 
M-N,  and  segment  N-U  (Fig.  2.3).  The  area  function  of  the 
lower  pharynx  (up  to  the  point  HI)  is  calculated.  Points  for 
the  sagittal  grid  lines  are  determined  on  the  arc  M-N  and  on 
bhe  line  N— U.  The  length  and  the  width  of  the  lower  part  of 
the  pharynx  may  be  easily  modified,  to  accommodate  all  the 
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- phoneme  name 
and  duration 

- ref  # 


jaw  J - 
velum  V 


tongue  body 
center  TONGC 


hyoid  position 
(or  default) 


tongue  tip  T — > 

(or  default) 

lip  height  HL  — > 
and  protrusion  PPL 


[BEGIN] 


[FIXED-ST] 


fixed  structure  is 
inserted 


[JAW] 


[VELUM] 


s 

[TONGBODY] 

\ 

[HYOID] 

anterior  outline  of 
pharynx  completed 


[TONGBLADE] 


[LIPS]  - all  outline  completed 


[SAG.  GRID]  - sagittal  grid  plot 


[AREA  FCN] 


area  function  plot 
— o external  file 


[FORMANTS] 


formant  points 
- constriction  and  NT 
o external  file 


< 5 targets 

6 targets 

- formant  contour 

- interpolation 


Figure  2.2 


Sequence  of  commands  for  the  articulatory  model. 
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different  configurations  of  phonemes  and  to  allow  the 
modelling  of  female  voices. 

[JAW1 . The  researcher  is  prompted  for  the  jaw  position 
J.  An  arc  appears  on  the  screen  with  a range  of  possible 
values . 

[ VELUM 1 . The  researcher  enters  the  position  of  the  velum 
V,  answering  the  prompt.  The  arc  of  circle  V-M,  which 
represents  the  soft  palate  is  displayed  on  the  screen. 

[ T0NGB0DY1 . A circle  that  represents  the  initial  tongue- 
body  outline  is  moved  by  the  researcher  to  the  desired 
position.  The  researcher  enters  the  tongue-body  center  TONGC. 
This  procedure  is  helpful  for  locating  the  constriction  caused 
by  the  tongue  body,  in  the  case  of  vowels.  A circular  arc  is 
displayed  to  represent  the  tongue  body  (Fig.  2.3) . A radius  of 
1.8  cm  gives  the  best  fit  for  most  of  the  phonemes. 

[HY0ID1 . The  researcher  is  prompted  for  the  point  P, 
which  determines  the  tongue-body-hyoid  line  offset.  The 
possible  excursion  of  P along  a straight  line  is  displayed 
(Fig.  2.4)  . An  approximate  default  position,  derived  from  the 
value  of  the  tongue-body-hyoid  distance  D-H,  can  be  optionally 
selected.  The  anterior  outline  of  the  pharynx  is  then 
displayed  (arc  H-P,  and  the  line  P-D) . 

_[TONGBLADE]  . The  researcher  is  prompted  for  the  tongue 
tip  T.  In  the  case  of  vowels,  a default  position  derived  from 
the  jaw  and  tongue-body  coordinates  can  be  selected.  The 
inferior  outline  of  the  vocal  tract  is  completed,  except  for 
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Figure  2.3  Interactive  articulatory  model. 

a)  fixed  structure  and  velum  phases. 

b)  tongue-body  phase. 
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Figure  2.4  Interactive  articulatory  model:  hyoid  phase. 

a)  choice  of  a point  on  line  HY,  or  default  or 

previous  values. 

b)  anterior-inferior  pharyngeal  wall  H-P-D 

completed. 
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a) 


Figure  2.5  Interactive  articulatory  model:  tongue-blade 
phase . 

a)  Choice  of  point,  or  default  or  previous 

values . 

b)  Tongue  blade  B-T  and  region  between  tongue 

tip  and  jaw  T-C-E-J  completed. 
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the  lips.  This  implementation  models  the  region  between  the 
tongue  tip  and  the  jaw  with  the  arc  T-C,  and  the  lines  C-E  and 
E-J  (Fig.  2.5).  This  procedure  determines  the  sagittal 
distances  quite  accurately,  even  when  the  tongue  tip  is  near 
or  beyond  the  jaw. 

[LIPS  1 . The  researcher  enters  the  lip  openness  (Fig  2.6) 
and  lip  protrusion  on  auxiliary  lines  and  then  the  lips  are 
drawn . 

[SAG  GRID) . This  macro  selects  proper  points  on  the 
anterior-inferior  and  posterior-superior  outlines  and  draws 
the  sagittal  lines  between  them.  This  grid  is  not  fixed,  that 
is,  it  varies  according  to  the  position  of  the  articulators, 
aimed  at  keeping  the  lines  "almost  perpendicular"  to  the  tract 
walls  (Fig.  2.7).  This  feature  provides  more  reliable 

sagittal  distances.  If  necessary,  final  adjustments  can  be 
made,  by  using  the  macro  [MODIF] . 

[AREA  FCN1  . The  sagittal  distances  and  their 

corresponding  cross-sectional  areas  are  determined.  The  area 
function  is  plotted  on  the  screen,  using  its  particular  color 

and  layer,  and  it  is  written  into  an  external  file.  The 

expected  phoneme  durations  label  the  time  axis  for  the 
tridimensional  representation  (Fig.  2.8).  The  articulatory 
vector  corresponding  to  the  latest  determined  area  function  is 
written  on  the  screen  too.  The  distances  between  the 
midpoints  of  two  consecutive  sagittal  lines  represent  the 
lengths  of  the  concatenated  tubes. 
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Figure  2 . 6 


Interactive  articulatory  model: lip  phase. 

a)  lip  height:  choice  of  point  on  line  h or 

previous  value. 

b)  lip  protrusion:  choice  of  point  on  line  p 

or  previous  value. 
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Fig.  2.7  Sagittal  grids  for  two  different  configurations 
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[FORMANTS  1 . The  first  four  formants  are  derived  from  the 
area  function.  A table  is  inserted  and  filled  in  with  the 
name  of  the  phoneme  and  its  duration,  the  constriction  area 
and  its  distance  to  the  glottis,  the  section  (for  equal-length 
tubes)  where  the  nasal  tract  begins,  and  the  values  of  the 
formants  (Fig.  2.8).  By  interpolating  the  original  area 
function,  an  equal-length-tube  area  function  is  obtained 
(keeping  the  original  values  of  the  reflection  coefficients) . 
These  data  are  fed  to  the  articulatory  synthesizer.  The 
target-positions  of  the  formants  are  plotted  in  the  same  color 
of  the  corresponding  area  function. 

[ INTERP 1 . After  obtaining  the  formants  for  a set  of  six 
(default  value)  target  configurations,  the  formant  contours 
are  drawn,  using  interpolation.  The  formant  values  at  any 
time  can  be  determined,  by  using  the  cross-hair  cursor  (Fig. 
2.8)  . 

Some  features  of  this  implementation  are: 

(1)  The  model  can  deal  with  both  vowels  and  consonants. 
The  tongue  tip  can  be  placed  at  any  position,  for  example, 
between  the  lips  (Fig.  2.9  and  Fig.  2.10). 

(2)  The  fixed  structure  can  be  easily  modified  to 
simulate  the  vocal  tract  of  females  or  children  (Fig.  2.9  and 
Fig.  2.10)  . 

(3)  Defaults  are  provided  for  the  hyoid  and  tongue  tip. 
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Figure  2.8  Final  screen  for  the  mapping  of  six  phonemes. 
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Modelling  stops  in  a female  vocal  tract, 
a)  bilabial.  b)  apico-alveolar . 


Figure  2.9 
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Figure  2.10  Modelling  stops  in  a female  vocal  tract. 

c)  prevelar.  d)  velar. 


95 


(4)  The  anterior  outline  of  the  pharynx  H-P  and  the 
region  under  the  tongue  tip  T-C  are  modelled  by  arcs.  This 
model  has  led  to  good  results. 

(5)  The  "grid"  varies  according  to  the  position  of 
the  articulators,  enabling  better  measurements  of  the  sagittal 
distances . 

(6)  Each  area  function  lies  in  a different  layer,  so 
that  any  combination  of  curves  can  be  displayed  or  stored. 

(7)  The  name  of  the  target,  the  corresponding  number  of 
the  articulatory  vector,  area  function,  and  formant  points  are 
represented  with  the  same  color. 

(8)  At  any  step,  if  necessary,  the  researcher  can  make 
adjustments  or  modifications. 

(9)  On-line  help  is  available  for  the  utilities  and  for 
the  articulatory  data  (range  of  values,  rules,  etc.). 

These  features  have  made  the  articulatory  model  efficient 
and  flexible.  As  we  will  report  in  Chapter  3,  this 
implementation  is  capable  of  generating  initial  configurations 
that  lead  to  the  fast  convergence  and  proper  articulatory 
dynamics  in  the  optimization  scheme. 

2 . 3 Acoustic  Model 
2.3.1  Glottal  Excitation 

As  mentioned  in  Chapter  1,  Section  1.4. 1.4,  the  glottal 
excitation  model  is  responsible  for  the  quality  of  the 
synthesized  voice  (Childers  et  al.. 


1983) . Therefore, 
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to  justify  the  adoption  of  a model  capable  of  providing  high- 
quality  synthetic  speech  and  simulating  abnormal  vocal  fold 
vibration,  a closer  look  into  the  kinematics  of  the  vocal 
folds  and  glottal  modelling  will  be  taken. 

2. 3. 1.1  Vocal  fold  vibration 

Figure  2.11  sketches  the  important  regions  of  the  human 
larynx.  The  larynx  is  a cartilaginous  tube  covered  by  a 
mucous  membrane  (Moore,  1871).  There  are  two  sets  of  folds: 
the  "true  vocal  folds"  (or  simply,  vocal  folds)  and  the  "false 
vocal  folds"  (or  ventricular  folds) . The  laryngeal  ventricle 
is  the  region  between  the  "false"  and  the  "true"  folds  and  the 
glottis  is  the  space  region  between  the  "true"  folds.  A well 
explained  description  of  the  laryngeal  cartilages  and  of  the 
muscles  responsible  for  the  vocal  fold  vibration  is  given  by 
Sorokin  (1985)  and  Chan  (1989)  . The  intrinsic  muscles  control 
the  longitudinal  tension  of  the  vocal  folds  (tensor  and 
relaxer  muscles)  and  the  opening  (abductor  muscle)  and  closing 
(adductor  muscle)  of  the  glottis.  The  most  important 
conclusions  (Titze  and  Talkin,  1979)  about  the  vibratory 
features  of  the  vocal  folds  are: 

(1)  The  vibratory  modes  depend  on  the  rest  area  (or  pre- 
phonatory  area)  of  vocal  fold  opening  (Fig.  1.24).  The 
vertical  pre-phonatory  shape  of  the  glottis  controls  the 
human  phonation  and  affects  the  glottal  waveform  (Titze, 
1988) . Thus,  the  initial  level  of  abduction  is  a "critical 
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MID-LINE 


Figure  2.11  Frontal  view  of  human  larynx  (Moore,  1971). 


ligament 


mucosa 


muscularis 
vocalis  thyroid 


cartilage 


Frontal  cross-sectional  sketch  of  the  internal 
tissue  structure  of  the  left  vocal  fold  (Chan, 
1989)  . 


Figure  2.12 
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factor  in  establishing  self-oscillation"  (Ishizaka  and 
Flanagan,  1972,  p.1233). 

(2)  The  vocal  folds  are  composed  of  three  important 
elements,  in  terms  of  phonation  modelling:  the  vocalis  muscle, 
the  mucosa  (mucous  membrane)  and  the  vocal  ligament  (Titze, 
1975) . There  are  two  significant  lags  between  movements  of 
the  folds:  the  vertical  lag  (or  vertical  phasing)  and  the 
longitudinal  phasing.  The  vertical  phasing  (mucosa  waving) 
occurs  between  the  movements  of  the  upper  and  lower  portions 
of  the  vocal  folds,  mainly  for  low  pitch  voices.  The 
longitudinal  phasing  occurs  between  the  movements  of  the 
anterior  and  posterior  portions  of  the  vocal  folds. 

(3)  The  maximum  amplitude  of  oscillation  of  the  vocal 
fold  is  another  factor  to  be  considered  in  the  models.  Along 
the  length  of  the  glottis,  the  amplitude  of  the  vibration  is 
maximum  in  the  middle.  Along  the  depth,  the  maximum  amplitude 
is  at  the  inferior  end  (Fig.  2.13). 

2. 3. 1.2  Two-mass  model 

The  two-mass  model  is  an  improvement  to  the  one-mass 
model,  the  first  self-oscillating  mechanical  simulation  of  the 
glottal  excitation.  The  one-mass  model  due  to  Flanagan  and 
Landgraf  (1968)  considers  the  vocal  folds  as  a single 
mechanical  oscillator  (mass,  spring  and  non-linear  damper) . 
The  opposing  masses  are  permitted  to  move  only  laterally  as  a 
whole,  that  is,  the  vertical  and  longitudinal  phasing  between 
the  movements  of  the  different  regions  in  the  folds  are  not 
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Maximum  excursion  profile  of  the  vocal  folds 
(Chan,  1989)  . 


Figure  2.13 
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considered  (Fig.  1.14).  The  model  replicates  source-tract 
interaction.  Even  though  the  features  of  the  vocal  folds  can 
not  be  completely  duplicated,  the  one-mass  model  enables  an 
acceptable  simulation  of  the  glottal  area  and  of  the  volume- 
velocity  waveform. 

Mechanical  model.  The  two-mass  model  (Ishizaka  and 
Flanagan,  1972)  considers  the  mucosa  and  the  vocalis-ligament 
(upper  and  lower  portions  along  the  depth  of  the  vocal  folds) 
as  two  coupled  mechanical  oscillators  (Fig.  2.14).  The  two 
independent  masses  mx  and  m2  can  move  only  laterally  and  they 
interact  under  the  influence  of  a force  that  tends  to  restore 
their  relative  equilibrium  position.  This  is  represented  by 
the  linear  spring  with  stiffness  kc.  The  elastic  properties 
of  the  folds  are  represented  by  the  springs  Si  and  s2,  while 
the  viscous  resistances  are  represented  by  the  damping  rx  and 
r2. 

In  order  to  control  the  fundamental  frequency,  Ishizaka 
and  Flanagan  (1972)  used  the  "tension  parameter"  Q as  a 
factor  to  scale  down  the  masses  and  thickness  and  scale  up  the 
springs.  A fundamental  frequency  varying  almost  linearly  with 
Q could  be  obtained  this  way  (range  of  120  to  220  Hz) . 

Equivalent  Electrical  Circuit . A complete  equivalent 
circuit  for  the  two-mass  self-oscillating  model  is  shown  in 
Fig.  2.15.  It  accounts  also  for  the  air  volume  displaced  by 
the  masses,  in  the  lateral  direction  x and  in  the  longitudinal 
direction  y.  Flanagan  and  Ishizaka  (1978)  concluded  that  the 
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CONTRACTION  GLOTTIS  EXPANSION 


and  m2  : masses 

and  d2  : thickness  of  mx  and  m2,  respectively 
and  s2  : equivalent  springs 
and  r2  : equivalent  viscous  resistances 
and  x2  : lateral  motions  of  mx  and  m2,  respectively 
and  Ag02  : rest  areas  of  and  m2,  respectively 
: effective  length  of  the  glottal  slit 
: contraction  distance 
: expansion  distance 
: stiffness  of  linear  spring 


Figure  2.14  Two-mass  model  of  the  vocal  folds 
(Ishizaka  and  Flanagan,  1972)  . 
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contributions  of  the  displacement  currents  were  negligible  for 
speech  synthesis  purposes.  Therefore,  all  the  shunt  elements 
can  be  neglected,  resulting  in  the  simpler  equivalent  circuit 
shown  in  Fig.  2.16.  The  values  of  the  equivalent  electrical 
components,  given  in  Fig.  2.16,  are  derived  by  applying  the 
fluid  flow  theory.  The  resistances  and  inductances  depend  on 
the  glottal  area  Ag:  (related  to  mass  m:)  or  on  the  glottal 
area  Ag2  (related  to  mass  m2)  , which,  in  turn,  depend  on  the 
lateral  displacement  Xj  and  x2,  respectively. 

The  lateral  displacements  xx  and  x2  are  the  solutions  of  the 
equations  of  motions  for  the  two  masses: 

% *i  + *i  Xj  + Sj.  + kc  (xx  - x2)  = Fj. 
m2  x2  + r2  x2  + s2  + kc  (x2  - xj  = F2 
where : 

mx,  m2,  rlr  r2,  k0  : given  in  Fig.  2.14; 

FI,  F2  : forces  acting  on  mx  and  m2,  respectively  (given  in 
Ishizaka  and  Flanagan,  1972,  Eq.  18); 

Si  and  s2  : spring  restoring  forces  (given  in  Ishizaka  and 
Flanagan,  1972,  Eq.  18); 

By  discretizing  the  motion  equations  with  backward 
differences  and  delaying  the  cubic  terms  by  one  sample,  a 
linear  system  of  equations  is  derived.  So,  if  x1  and  x2,  are 
determined  all  the  elements  in  the  equivalent  circuit  can  be 
calculated. 
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Figure  2.15  Complete  equivalent  circuit  for  the  two-mass 
model  (Flanagan  and  Ishizaka,  1978) . 
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- Rc  : abrupt  contraction  at  the  inlet  to  the  glottis. 

- Rvx  and  Rv2  : viscous  losses  at  the  lower-fold  edge 

upper-fold  edge,  respectively. 

- R12  : change  in  kinetic  energy  per  volume  of  fluid,  at 

the  junction  between  masses  mx  and  m2. 

- Re  : expansion  of  the  glottal  outlet. 

- Lc,  Lgx  and  Lg2  : inertances  of  the  air  masses. 


Figure  2.16  Simplified  equivalent  circuit  for  the  two-mass 
model  (Flanagan  and  Ishizaka,  1972) . 
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Results  from  Ishizaka  and  Flanagan' s model.  Some  results 
that  are  relevant  for  our  objective  of  achieving  a glottal 
model  capable  not  only  of  producing  high-quality  synthesis  but 
also  of  performing  specific  experiments  are: 

(1)  The  increase  of  Q causes  an  increase  in  the 
fundamental  frequency,  a reduction  in  the  phase  difference 
between  the  cross-sectional  glottal  areas  Agx  and  Ag2,  and  a 
decrease  in  the  amplitude  of  the  oscillation  of  mx  and  m2 
(larger  decrease  for  1%)  . Figure  2.17  shows  the  glottal  areas 
and  glottal  volume  velocity  for  three  different  values  of  Q. 

(2)  The  natural  volume  velocity  waveform  is  left-skewed 
due  to  the  vertical  phasing  and  to  the  different  amplitudes  of 
vibration  of  m:  and  m2. 

(3)  Beyond  a specific  rest-area  threshold  the  model  does 
not  maintain  the  oscillations  (about  0.25  cm2  in  Ishizaka  and 
Flanagan's  model). 

(4)  The  fundamental  frequency  FO  increases  almost 
linearly  with  the  increase  of  the  subglottal  pressure  Ps.  The 
greater  the  nonlinear  coefficient  of  stiffness,  the  greater  is 
the  slope  of  the  FO(Ps)  function  (Fig.  2.18).  The  amplitude 
of  the  vibrations  also  become  greater  as  the  subglottal 
pressure  increases. 

(5)  The  open  quotient  decreases  as  the  subglottal 
pressure  increases  but  is  almost  asymptotic  around  0.6  (Fig 
2.19)  . 
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Glottal  areas  and  volume  velocity  for  three 
different  values  of  the  tension  parameter  Q 
(Ishizaka  and  Flanagan,  1972)  . 


Figure  2.17 
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tu  : nonlinear  coefficient  of  stiffness 


.18  Effect  of  the  subglottal  pressure  Ps  on  the 
fundamental  frequency  (Ishizaka  and 
Flanagan,  1972) . 


Figure  2.19  Effect  of  the  subglottal  pressure  Ps  on  the 
duty  cycle  or  open  quotient  (Ishizaka  and 
Flanagan,  1972) . 
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2. 3. 1.3  Proposed  models 

Introduction . Searching  for  suitable  models,  our  concern 
is  to  harmonize  the  following  requisites: 

(1)  To  keep  the  performance  of  the  two-mass  model  in 
terms  of  quality  of  the  synthetic  speech. 

(2)  To  keep  the  computational  burden  low. 

(3)  To  be  able  to  perform  experiments  in  modelling  voice 
conversion  and  pathological  voices,  by  an  adequate  control  of 
the  parameters  (pitch  and  glottal  events) . 

Therefore,  two  models  are  proposed. 

First  model:  parametric  2-mass  model.  The  first  model  is 
based  on  the  two-mass  model.  Nevertheless,  instead  of  being 
derived  from  the  motion  equations  of  the  self-oscillating 
model  (to  determine  the  lateral  displacements  x:  and  x2) , the 
glottal  areas  Agx  and  Ag2  are  generated  by  a proper 
parameterization.  Observations  of  their  waveforms  are 
available  from  high-speed  laryngeal  cinematography,  PGG,  UGG, 
etc.  (Chap  1,  Section  1.4. 1.4)  and  from  the  results  of  the 
two-mass  model.  For  normal  voices,  they  present  a phase 
difference  0 of  about  55  degrees  and  their  open  quotient  Qo 
(glottal  open  time/total  period)  is  about  0.6,  according  to 
Ishizaka  and  Flanagan  (1972) . The  effects  of  the  rest  area 
and  of  the  subglottal  pressure  were  mentioned  in  Section 
2. 3. 1.2.  Hence,  proper  specification  of  glottal  areas  can  be 


made . 
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Rk:  L*?i  Rvx 

o Wv HTTP WV 


Rv2  Lg2 

aaa — nrpp 


Rk2 

AAV 


■o 


Rkj  = 0.19  p U(l)  / Ag;2 
Rvx  = 12  pi  2 dx  / Ag^ 

Lgi  = p dx  / Agj 

Rk2  = p [0.5-(Ag2/AF!)  (l-Agj/AFj)  ] U(l)/  Ag22 
Rv2  = 12  pi  2 d2  / Ag23 
Lg2  = p d2  / Ag2 


p : density  of  air  : 1.14xl0'3  gm*cm'3 
p : viscosity  of  air  : 1.86xl0~4  dyne*s*cm'2 
dx:  thickness  of  mass 

d2:  thickness  of  mass  m2;  di  + d2  = 0.3  cm 

lg:  glottal  length  : 1.4  cm 

U(l):  volume-velocity  in  the  first  section 

AFX:  area  of  the  first  section  of  the  vocal  tract 


Agi(t)  = A1  [0.5-0. 5 cos  (7tt/To)  ] +Agmin 
A1  cos  [Jt  (t-To) /2Tc]  +Agmin 
Agmin 


0 < t < To 
To  < t < To+Tc 
To+Tc  < t < T 


Ag2  (t)  = Agmin  0 < t < X or  To+Tc+x  < t < T 

A2{0.5-0.5  cos  [JC (t-x) /To]  }+Agmin  x < t < To+x 

A2  cos{  [JC  (t-To-x) /2Tc]  )+Agmin  To+x  < t <To+Tc+x 


where : 

T : pitch  period;  Agmin  : rest  area 

To  : duration  of  opening  phase  = Qo  Qm  T/l+Qm 
Tc  ; duration  of  closing  phase  = Qo  T/l+Qm 

Qo  : open  quotient  Qm  : speed  quotient 

A1  + Agmin  : peak  of  glottal  area  Agx 
A2  + Agmin  : peak  of  glottal  area  Ag2 
Sk  = A2  - A1  : steepness  constant 
X : time  delay  between  the  two  glottal  areas: 
x = 0 T / 360 

0 : phase  difference  between  Agx  and  Ag2. 


Figure  2.20  First  model  for  the  glottal  source: 
parametric  2-mass  model. 
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The  adopted  waveform  for  the  glottal  area  functions  is  a 
raised-version  of  the  model  of  Ananthapadmanabha  and  Fant 
(1982) : 

Ag:(t)  = A1  [0.5-0. 5 cos  (7lt/To)  ] + Agmin  0 < t < To 

Agx  (t ) = A1  cos  [71  (t-To)  /2Tc]  + Agmin  To  < t < To  + Tc 

Agx(t)  = Agmin  To  + Tc  < t < T 

Ag2(t)  = Agmin  Octet  or  To  + Tc  + T<t<T 

Ag2(t)  = A2  { 0 . 5-0 . 5 cos  [71  (t-x) /To]  } + Agmin  x < t < To  + x 
Ag2(t)  = A2  cos{  [71  (t-To-X)  /2Tc]  } + Agmin  To+x  < t <To+Tc+X 
where : 

T : pitch  period 

To  : duration  of  opening  phase  To  = Qo  Qm  T / 1 + Qm 

Tc  : duration  of  closing  phase  Tc  = Qo  T / 1+Qm 

Qo  : open  quotient 
Qm  : speed  quotient 

Agmin  : rest-area  (minimum  glottal  area) 

A1  + Agmin  : peak  of  glottal  area  Agx 

A2  + Agmin  : peak  of  glottal  area  Ag2 

X : time  delay  between  the  two  glottal  areas: 
x = © T / 360 

© : phase  difference  between  Agx  and  Ag2 
Figure  2.20  illustrates  the  parametric  2-mass  model. 
Second  model:  equivalent  glottal  area.  The  second 

proposed  model  combines  the  simplicity  of  the  one-mass  model 
with  the  glottal  area  function  generated  by  the  two-mass 
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model.  The  pressure  distribution  along  the  glottal  flow  (Fig. 
1.20)  is  well  explained  by  Ishizaka  and  Flanagan  (1972) . From 
the  relationship  between  pressure  differences  and  volume 
velocity,  the  acoustic  impedance  elements  (electrical 
equivalent  circuit)  are  determined  (Fig.  2.16).  The  pressure 
drop  corresponding  to  each  component  is  caused  by: 

Rc  : abrupt  contraction  at  the  inlet  to  the  glottis. 

Rv:  and  Rv2  : viscous  losses  at  the  lower-fold  edge  and 
upper-fold  edge,  respectively. 

R12  : change  in  kinetic  energy  per  volume  of  fluid,  at  the 
junction  between  masses  mx  and  m2 . 

Re  : expansion  of  the  glottal  outlet. 

Lc,  Lgx  and  Lg2  : inertances  of  the  air  masses. 

The  total  acoustic  impedance  is  given  by: 

Zg  = (p/2)  Ug  [(0.37  / Agx2)  + ( 1-2  A21  (1-A21)  ) / Ag22]  + 

+ (Rv1  + Rv2)  + jw  (Lc  + Lgx  + Lg2)  (2.1) 

where : 

A2i  = Ag2  / AFi  AFX  : area  of  the  first  section 

of  the  vocal  tract 

The  real  part  of  the  empirical  glottal  impedance  formula 
presented  by  van  den  Berg  et  al . (1957)  is  : 

Real  (Zvg)  = -0.87  (p/2)  Ug  / Ag2  + 12.0  plg2d/Ag3  (2.2) 

where : 

p : density  of  air  p.  ; coefficient  of  viscosity 

d : total  depth  of  glottis  lg  : length  of  the  glottis 


Ill 


The  first  term  of  (2.2)  represents  the  kinetic  resistance 
and  the  second  term,  the  viscous  resistance.  For  a non- 
uniform  glottal  width,  this  second  term  is  not  significant 
(van  den  Berg,  1968) . 

Comparing  equations  (2.1)  and  (2.2)  and  neglecting  the 
viscous  losses,  an  equivalent  glottal  area  Ageq  is  derived: 

Ageq  = Nf  { Agx2  Ag22 } 1/2/  { 0 . 37Ag22+  [ 1-2A21  (1-A21)  ] Agi2}1/2 

(2.3) 

where:  Nf : normalization  factor 

Therefore,  the  glottal  kinetic  resistance  Rkln  for  a one- 
duct  representation  of  the  glottis,  as  a function  of  the 
equivalent  glottal  area,  will  be: 

Rkln  = (p/2)  Ug  [1.37/ Ageq2  - 2/AgeqAFl  (1  - Ag/Al)  ] (2.4) 

Finally,  if  the  viscous  resistance  and  the  inertances  are 
to  be  considered,  assuming  a one-mass  model,  the  equivalent 
glottal  impedance  becomes: 

Zeq  = Rkin  + Rv  + jwLg  (2.5) 

where : 

Rv  = 12  p lg2  d/Ageq3 
Lg  = p d/Ageq 

Figure  2.21  illustrates  this  second  model  for  the  source. 
The  equivalent  glottal  area  Ageq  matches  the  waveforms 
proposed  by  researchers  (Ananthapadmanabha  and  Fant,  1982; 
Allen  and  Strong,  1985)  more  properly  than  the  projected 


glottal  area  does  (Cranen  and  Boves,  1987)  . Figure  2.22  shows 
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Rv  = 12(il,2d/Ageq3  Lg  = pd/Ageq 

Rktn  = (p/2)  U (1)  [1.37/Ageq2  - (2/Ag.q  AFJ  (1-Ag/AF,)  ] 

d : thickness  of  the  glottal  slit 

Ageq  = Nf  {Agi2Ag22)1/2/  {0.37Ag22+  [1-2A21  (1-A21)  ] Ag x2 } 1/2 

where : 

■^21  = Ag2  / AFj 

AFX:  area  of  the  first  section  of  the  vocal  tract 
Nf  : normalization  factor 


Figure  2.21  Second  model  for  the  glottal 
equivalent  glottal  area  Ageq 
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Figure  2.22  Equivalent  glottal  area  Ageq  and  projected 
glottal  area  Agp. 
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the  equivalent  and  the  projected  glottal  areas  for  a vertical 
phase  difference  of  55  degrees  and  a rest  area  (or  leak 
opening)  of  5 mm2. 

2. 3. 1.4  Control  Parameters. 

The  proposed  models  fulfill  the  desired  requisites 
specified  earlier  in  this  section.  The  features  of  the  two- 
mass  model  are  available,  without  the  usually  heavy 
computational  burden  of  the  model.  Furthermore,  the  control 
parameters  can  be  properly  derived. 

Fundamental  frequency.  The  cord-tension  parameter  Q is 
not  used  to  control  the  pitch  period;  instead,  the  fundamental 
frequency  contour  (intonation  patterns)  is  supplied  directly 
to  the  model  (T  in  the  glottal  area  function)  . Several 
techniques  have  been  used  to  extract  the  pitch  period  contour 
from  the  speech  signal  (Chap.  1,  Section  1.4. 1.4).  We 
developed  a new  scheme  to  extract  the  pitch  contour  very 
accurately  and  to  determine  the  jitter  (as  an  aid  for 
pathological  voice  identification)  . The  algorithm  utilizes 
the  same  techniques  employed  by  Deem  et  al.  (1989),  namely, 
peak  picking,  zero  crossing,  and  interpolation,  but  the 
algorithm  is  completely  automatic.  Figure  2.23  shows  the 
pitch  contour  for  a segment  of  the  utterance  "We  were  away  a 
year  ago,"  using  this  scheme. 

Amplitudes.  The  amplitudes  of  the  glottal  areas  Al  and 
A2  and  also  the  subglottal  pressure  are  related  to  the  energy 
of  the  speech.  Figure  2.24  shows  the  power  contour  for  the 
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Pitch  contour  for  "We  were  away  a 
a)  25  ms  window.  b) 

c)  entire  sentence  (smoothed)  . 
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year  ago." 

120  ms  window. 
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POWER  CONTOUR 


Figure  2.24 


Power  contour  and  spectral  transition  rate 
contour  for  the  sentence  "We  were  away  a year 
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sentence  "We  were  away  a year  ago."  As  an  additional  control 
for  the  steepness  of  the  slopes  in  the  equivalent  glottal 
area,  A1  and  A2  are  related  by: 

A2  = Al  + Sk 

where  Sk:  steepness  constant; 

Al : typical  value  for  stressed  value:  0.2  cm2 

Open  and  speed  quotients.  The  open  quotient  Qo  can  be 
conveniently  estimated  by  means  of  peak  picking  in  the  DEGG 
signal  (Fig.  1.26).  When  the  DEGG  is  not  available,  an 
inverse-filtering  technique  can  be  employed.  An  accurate 
estimation  of  the  closed  phase  interval  is  provided  as 
byproduct  of  the  weighted  recursive  least  squares  algorithm 
with  variable  forgetting  factor  (WRLS-VFF) . The  "WRLS-VFF" 
(Ting  and  Childers,  1990)  is  primarily  used  for  formant 
tracking  and  the  closed  phase  interval  is  estimated  by  the 
variable  forgetting  factor  error.  The  speed  quotient  Qm 
determines  the  asymmetry  of  the  glottal  area.  For  normal 
voices  the  glottal  areas  are  right-skewed,  which  means  that  Qm 
is  greater  than  0.5.  In  addition,  MMIRC  has  collected  and 
processed  modal  and  pathological  voices  in  order  to  derive 
statistical  data  about  the  excitation  waveforms.  Therefore, 
Qo  and  Qm  ranges  of  values  are  available  for  simulation  of 
different  kinds  of  voices. 

We  have  come  up  with  a relationship  between  the  quotients 
of  the  glottal  areas  Agx  and  Ag2  and  the  quotients  of  the 
equivalent  glottal  area.  Considering  the  two  glottal  areas 
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Agx  and  Ag2  with  the  same  quotients  (Qox  = Qo2  = Qo  and  Qi^  = 
Qm2  = Qm) , the  following  relationship  are  used  in  the 
simulation  of  the  equivalent  glottal  area  Ageq. 

Qo  = Qoeq  + 0/360  (2.6) 

Qm  = (Qoeq  Qmeq  + 0/720)  / Qo  (2.7) 

where : 

Qoeq  and  Qmeq  : open  and  speed  quotients  of  Ageq. 

0:  phase  difference  between  Agx  and  Ag2. 

Subglottal  pressure  Ps.  The  subglottal  pressure  Ps,  for 
vowels,  is  proportional  to  the  square  root  of  the  energy  of 
the  speech.  For  consonants,  to  account  for  the  pressure  drop 
at  the  constriction,  the  value  of  Ps  is  greater  than  that  of 
vowels  (Isshiki,  1964)  . The  spectral  distance,  used  to  adapt 
the  window  size  in  the  formant  synthesizers  (Pinto  et  al., 
1989)  , can  also  assist  the  estimation  of  the  subglottal 
pressure.  Figure  2.24  shows  the  spectral  transition  rate 
contour  for  the  sentence  "We  were  away  a year  ago." 

Discussion . The  pitch  period  T,  the  rest  area  Agmin,  the 
amplitude  A and  the  open  and  speed  quotients  Qo  and  Qm  are 
used  by  the  formant,  LPC  and  articulatory  synthesizers.  Given 
the  values  of  the  quotients,  those  for  the  equivalent  glottal 
area  model  can  be  determined,  by  using  the  equations  (2.6)  and 
(2.7)  . Our  models  can  provide,  however,  additional  parameters 
for  the  simulation  of  pathological  and  female  voices:  the 
subglottal  pressure  Ps,  the  amplitudes  Al  and  A2  of  the 
glottal  area  functions,  the  phase  difference  between  the 
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glottal  areas  0,  the  effective  length  of  the  glottal  slit  lg, 
the  thickness  of  the  glottal  slit  d,  and  the  thickness  d:  and 
d2  of  the  masses  m:  and  m2,  respectively  (first  model  only) . 

Figures  2.25  to  2.29  show  the  effects  of  some  glottal 
parameter  variation  on  the  equivalent  glottal  area  Ageq. 

2.3.2  Vocal  and  Nasal  Tract  Models. 

In  Sections  1.3.3  and  1.4.2,  we  discussed  the  vocal  and 
nasal  tract  models.  Figure  1.7  shows  the  block  diagram  for 
the  articulatory  synthesizer.  Table  1.3  summarizes  the 
acoustic  tube  equations.  Figure  1.9  shows  the  equivalent 
lumped-acoustic  transmission  line  for  an  elemental  lossy 
vocal-tract  tube  and  Fig.  1.27  illustrates  the 
interconnection  of  the  vocal  and  nasal  tracts.  Figure  1.35 
shows  the  network  representation  for  the  whole  synthesizer. 
Section  1.4. 2. 3 deals  with  the  noise  source  models  for 
fricatives,  stops  and  aspirated  sounds  and  Section  1.4. 2. 4 
discusses  the  main  approaches  to  implement  the  concatenation 
of  tubes. 

A time-domain  approach  inspired  by  Maeda's  work  (1982) 
has  been  chosen  for  the  tracts.  The  continuous  acoustic  tube 
equations  are  digitized  by  the  "midpoint  rule."  Since  the 
acoustical  equations  are  stiff  (Childers  and  Ding,  1991),  an 
implicit  method,  the  trapezoidal  algorithm,  has  been  adopted 
in  order  to  solve  the  differential  equations,  with  numerical 
stability,  regardless  of  the  step  size.  The  "trapezoidal 
rule"  and  the  "central  difference  with  averaging"  provide 
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Figure  2.25  Effect  of  the  open  quotient  Qo  on  the 
equivalent  glottal  area  Ageq. 


sq  cm  sq  cm  sq  cm  sq  cm 


121 


Figure  2.26  Effect  of  the  speed  quotient  Qm  on  the 
equivalent  glottal  area  Ageq. 
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Figure  2.27  Effect  of  the  rest  area  Agmin  on  the 
equivalent  glottal  area  Ageq. 
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Figure  2.28  Effect  of  the  phase  difference  0 on  the 
equivalent  glottal  area  Ageq. 
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Figure  2.29 


Effect  of  the  amplitude  difference  Sk  on  the 
equivalent  glottal  area  Ageq. 
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the  equivalent  resistive  discrete  circuit  for  the  capacitors 
and  inductors  in  the  network,  as  illustrated  in  Fig.  2.30.  In 
each  T-section,  the  equivalent  resistors  and  sources  are 
associated.  The  resulting  resistive  circuit  is  then  simply 
dc-analyzed  with  the  Kirchoff's  theorems. 

The  articulatory  model,  using  a database,  provides  the 
60-section  vocal  tract  area  function  (equal  length) , the 
position  and  area  of  the  constriction,  and  the  position  of  the 
coupling  between  the  nasal  and  vocal  tract,  for  each  target  in 
the  segment  of  speech. 

The  nasal  tract  is  represented  by  11  sections  of  equal 
length  (1cm) . The  data  are  obtained  from  Fant  (1960) . The 
first  three  sections  can  vary  due  to  the  velum  movement  while 
the  others  are  fixed.  The  areas  of  the  section  2 and  3 are 
the  interpolated  values  between  the  areas  of  the  section  1 and 
4.  The  area  of  the  first  section  (velopharyngeal  port) 
controls  the  degree  of  coupling  between  the  tracts.  The  range 
of  values  is  0 to  1 cm2  and  the  perception  of  nasality  begins 
with  a port  area  by  50  mm2  (Borden  and  Harris,  1984)  . The 
maxillary  sinuses  can  be  optionally  connected  to  the  seventh 
section  of  the  nasal  circuit.  Data  for  the  sinuses  are 
available  in  Maeda's  work  (1982a). 

2.3.3  Radiation  and  Noise  Source 

The  basic  points  about  the  radiation  and  the  noise  source 
models  were  presented  in  Sections  1.4. 2. 2 and  1.4. 2. 3.  The 
simplified  radiation  model  (high-pass  filter  6 dB  /octave) 
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a.  TRAPEZOIDAL  ALGORITHM 
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Equivalent  resistive  discrete  circuit  for 
linear  capacitors  and  inductors. 


Figure  2.30 
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due  to  Flanagan  (1972a)  has  been  used  (Fig. 1.30).  The 
radiation  through  the  walls  of  the  throat  is  neglected. 

To  simulate  unvoiced  sounds,  we  followed  Sondhi  and 
Schroeter's  approach  (1987).  The  noise  sources  are 
controlled  by  the  Reynolds  number  Re,  which  determines  the 
transition  from  laminar  to  turbulent  flow.  The  squared 
Reynolds  number  is: 

Re2  = 4p2  Uc2  / n\i2  Ac 

where : 

Uc:  volume-velocity  at  the  constriction 
Ac:  area  of  the  constriction 
p:  density  of  air 
p:  viscosity  of  air 

If  the  local  Reynolds  number  Re  exceeds  the  critical 
Reynolds  number  Rec  (empirically  estimated) , then  there  is 
turbulence,  which  can  be  modelled  by  a random  noise  source. 

For  aspirated  sounds,  a series  noise  pressure  pa  is 
inserted  before  the  first  vocal  tract  section: 

Pa  = ga  rand  (Re2  -Rec2)  if  Re  > Re0 

= 0 otherwise 

where : 

ga:  gain  ( 2 x 10-6) 

rand:  random  number  uniformly  distributed 
between  -0.5  and  0.5 


Rec:  about  2700 
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For  fricatives  and  plosives  a current  source  un  (parallel 
configuration,  Fig.  1.31)  is  inserted  "one  section  downstream 
of  the  outlet  of  the  narrowest  constriction"  (Sondhi  and 
Schroeter,  1987,  p.  962) : 

un  = gn  rand  (Re2  -Rec2)  / Rn  if  Re  > Rec 

= 0 otherwise 

where : 

gn:  gain  (10-6) 

Rec : 2000 

Rn : source  resistance  = puc  / 2 Ac2 

The  energy  of  the  turbulent  noise,  distributed  over  a 
range  between  2 and  8 kHz,  is  more  pronounced  in  the  4 kHz 
region  (Isshiki  et  al . , 1978). 

2 . 4 Analysis/Svnthesis  Scheme 

The  generation  of  synthetic  speech  on  an 

analysis/synthesis-based  articulatory  synthesizer  comprises 
two  different  phases:  an  off-line  phase  aimed  at  deriving  the 
input  data  for  the  synthesizer  and  an  on-line  phase  for  the 
"resynthesis . " 

During  the  off-line  phase  (analysis)  two  files  are 
established:  the  source  file  and  the  vocal  tract  file.  Figure 
2.31  is  a flowchart  for  the  analysis  phase.  The 
classification  of  the  speech  into  V/U/M/S  and  the  location  of 
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Figure  2.31  Articulatory  synthesizer:  analysis  phase 
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the  phoneme  boundaries  assist  the  selection  of  targets  in  the 
speech  segment.  The  classification  is  based  on  short-time 
energy  (Fig.  2.32),  short-time  zero-crossingrate  (Fig.  2.33), 
spectral  analysis  and  differentiated  EGG  (Pinto  et  al.,  1989, 
Childers  et  al.,  1989) . In  our  scheme  the  non-smoothed  pitch 
contour  is  also  used  as  an  input  to  the  classifier  (Fig. 
2.34).  The  EGG  is  used  for  the  classification  and  for 
deriving  the  open  quotient  Qo  (Section  2. 3. 1.4).  When  not 
available,  an  inverse  filtering  technique,  such  as  the  WRLS- 
FF,  can  be  employed  (Section  2. 3. 1.4) . Figure  2.35  shows  the 
final  classification  for  the  sentence  "We  were  away  a year 
ago . " 

Figure  2.36  sketches  the  setup  of  the  source  and  vocal 
tract  files.  Only  two  targets  are  shown  for  each  file.  Each 
record  of  the  source  file  contains  one  target,  comprising  tt, 
the  time  epoch  of  the  target,  FO,  the  pitch  period,  Amin,  the 
minimum  glottal  area.  A,  the  maximum  glottal  area,  Qo,  the 
open  quotient,  Qm,  the  speed  quotient,  and  Ps,  the  subglottal 
pressure.  Each  record  of  the  vocal  tract  file  includes  TT, 
the  time  epoch  of  the  target,  NAS,  the  section  number  of  the 
velum,  OPNAS,  the  opening  of  the  velum,  RLEN,  the  length  of  a 
section,  and  AF,  the  area  function  of  the  target. 

The  synthesis  phase  is  shown  in  Fig.  2.37.  The  first 
time  the  routines  are  called  (TIME=TFIRST) , the  following 
values  are  established: 


S T ZCR  S T ENERGY 
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We  were  away  a year  ago 


Figure  2.32  Short-time  energy  contour. 


We  were  away  a year  ago 


Figure  2.33  Short-time  zero-crossing  rate. 
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Pitch  Contour  as  Input  to  V/U/M/S  Classification 


Note : The  smoothed  pitch  contour  is  given  in  Fig.  2.23 

Figure  2.34  Raw  pitch  contour  as  an  input  to  V/U/M/S 
classifier . 
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Fi^uL'e  2.35  Classification  of  the  sentence  "We  were  away  a 
year  ago . " 
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Figure  2.36  Illustration  for  the  set  up  of  the  input  files 
to  the  articulatory  synthesizer:  source  and 
vocal  tract  files. 
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Figure  2.37  Articulatory  synthesizer:  synthesis  phase. 
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(1)  Initial  rest  conditions  corresponding  to  the 
capacitors  and  inductors  in  the  network  (See  Fig.  2.30). 

(2)  Physical  constants:  velocity  of  sound,  viscosity  of 
air,  density  of  air,  etc. 

(3)  Options  for  the  synthesis:  number  of  sections, 

sampling  frequency,  etc. 

(4)  Controls  for  the  program  and  constant  data. 

Then,  the  values  in  between  the  targets  are  obtained  by 
using  linear  interpolation  (Me  Neilage,  1970;  Gay,  1977) . 
Using  the  source  file,  the  subglottal  pressure  Ps  (TIME)  is 
interpolated  at  each  time.  The  glottal  area  waveforms 
Agx  (TIME)  and  Ag2  (TIME)  (parametric  2-mass  model)  orAgeq(TIME) 
(equivalent  glottal  area)  are  generated  using  cycle-by-cycle 
interpolation.  Using  the  vocal  tract  file,  the  area  functions 
and  section  lengths  are  interpolated  at  each  time,  with  the 
number  of  sections  converted  optionally  to  30,  20,  15,  12,  or 
10  sections.  Scaling  of  areas  and  lengths  can  be  also 
performed,  to  provide  flexibility  to  the  modelling  of  female 
voices . 

The  resistive  equivalent  network  is  set  up  (trapezoidal 
algorithm) . Modifications  to  the  glottal  length  and 
thickness,  velopharyngeal  port  area,  sinuses,  critical 
Reynolds  number,  and  on  the  yielding  wall  physical  values 
(resistance,  stiffness  and  mass  per  unit  area)  can  be  made  via 
parameter-files,  to  allow  easier  experimentation.  Noise 
sources  are  inserted  for  the  case  of  fricatives,  stops  and 
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aspirated  sounds.  The  equations  for  the  pressures  at  the 
nodes  and  volume-velocities  at  the  branches  are  solved  using 
backward  substitution.  The  finite-difference  derivative  of 
the  sum  of  the  volume-velocities  at  the  nostrils  and  lips 
represents  the  synthetic  speech  at  time=TIME.  The  procedure 
is  repeated,  with  updated  initial  conditions  until  the  final 
time  epoch  is  reached.  To  be  played,  the  signal  is  finally 
processed  by  a digital-to-analog  converter  and  by  an  anti- 
aliasing filter. 

2 . 5 Results  and  Discussion 

Our  time-domain  realization  for  a target -based 
articulatory  synthesizer  is  an  efficient  tool  for  the 
investigation  of  speech  synthesis. 

Suitable  initial  configurations  for  the  optimization 
process  to  be  described  in  Chapter  3 can  be  derived  from  the 
interactive  graphic-editor  articulatory  model,  which  draws  the 
vocal-tract  outline  in  an  articulator-to-articulator  basis, 
using  "button"  macro  commands,  cross-hair  coordinates  or 
keyboard  entries. 

The  equivalent  glottal  area  model  meets  the  requirements 
of  low  computational  burden  and  availability  of  several 
control  parameters  for  the  experiments.  The  equation  relating 
the  quotients  of  Ananthapadmanabha  and  Fant's  model  (1982)  and 
the  equivalent  glottal  area  quotients  allows  the  simulation  of 
the  desired  pathological  voice  features.  The  derivation  of 
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the  parameters  and  the  choice  of  targets  is  well  supported  by 
our  previous  work  on  analysis  of  speech.  Further  our 
algorithm  for  pitch  extraction  proved  to  be  robust. 

The  source-tract  interaction  is  considered  and  the  nasal 
tract  and  sinuses  can  be  easily  incorporated  into  the  model. 

The  use  of  the  equivalent  resistive  circuit  (trapezoidal 
algorithm)  provides  a great  improvement  in  computational  speed 
over  some  other  methods,  but  further  enhancement  can  be 
pursued.  The  ease  with  which  the  parameters  can  be  modified 
must  also  be  pointed  out. 

Gupta  and  Schroeter  (1991)  classified  the  schemes  of 
adaptation  as  "single-frame  optimization"  if  the  parameters 
are  adapted  once  every  pitch-period,  and  as  "multi-frame 
optimization"  if  the  parameters  are  adapted  "every  few  pitch 
periods."  Using  this  classification  scheme,  our  method  is  a 
multi-frame  optimization  technique.  However,  we  prefer  to 
call  it  "target -based  optimization."  The  word  "frame"  is 
associated  with  the  analysis  phase.  Our  window  lengths 
(targets)  are  determined  by  the  results  of  the  classification 
and  labeling  of  the  utterance  to  be  synthesized.  The  more 
targets,  the  higher  the  fidelity  of  the  synthetic  speech,  and 
the  higher  the  computational  time,  however.  We  have  used  an 
average  of  one  area  function  target  per  each  58  ms  and  one 
source  tract  target  per  each  48  ms.  The  number  of  vocal  tract 
targets  depends  on  the  number  of  phonemes  and  on  their 
durations.  A non-linear  interpolation  technique  is  proposed 
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for  future  investigation  in  order  to  improve  on  the 
transitions  between  targets  (Chap.  5,  suggestion  4)  . The 
quality  of  the  synthetic  speech  can  be  highly  improved  by 
smoothing  the  glottal  parameters,  mainly  the  subglottal 
pressure  and  the  fundamental  frequency.  Again  the  tradeoff  is 
quality  versus  computational  time. 

The  frequency  warping  caused  by  the  mapping  from  the 
continuous  system  to  the  discrete  system  is  a function  of 
frequency  and  depends  on  the  sampling  period  T,  and  on  the 
sampling  interval  in  space  X.  An  error  analysis  was  conducted 
by  Flanagan  (1972a)  and  Maeda  (1982)  . Our  realization 
provides  the  options  of  10,  12,  15,  20,  30  or  60  vocal  tract- 
sections  and  of  10,  20,  30,  or  40  kHz  for  the  sampling 
frequency.  A choice  of  30  sections  (element  length  of  about 
0.5  cm)  and  30  kHz  assures  a small  warping  magnitude  in  the 
range  of  0 to  4 kHz,  with  a tolerable  increase  in  the 
computational  time. 

Figures  2.38  to  2.42  show  the  synthetic  sentences  "We 
were  away  a year  ago,"  " Good  bye,  Bob,"  and  "Ben  went 
mining, " obtained  by  using  the  parametric  2-mass  model,  for 
various  values  of  the  glottal  parameters.  Figures  2.43  to 
2.47  show  the  same  sentences,  obtained  by  using  the  equivalent 
glottal  area,  also  for  various  values  of  the  glottal 
parameters.  Both  models  produce  quite  good  results,  for 
proper  values  of  the  parameters.  However,  an  optimization 
scheme  for  the  glottal  excitation  parameters  can  provide  the 
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best  quality  for  the  synthetic  sentences  (Chap  5,  suggestion 
6) . The  quality  of  the  nasal  sounds  needs  to  be  improved,  by 
adopting  a special  scheme  for  the  inverse  mapping  of  nasals 
(Chap  5,  suggestion  3) . 

The  processing  time  in  a SPARCstation  1+  is  about  1 min 
for  each  second  of  speech,  using  30  sections,  sampling 
frequency  of  40  kHz,  and  one  additional  output  channel  for 
writing  parameters. 
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Figure  2.38  Synthetic  sentence  "We  were  away  a year  ago," 
using  the  parametric  2-mass  model,  © = 10, 

Qo  = 0.96,  Qm  = 0.5,  F0av  = 100  Hz,  Sk  = 0.01 
and  Agmin  = 0.025  cm2. 
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Figure  2.39  Synthetic  sentence  "We  were  away  a year  ago," 
using  the  parametric  2-mass  model,  0 = 55, 

Qo  = 0.65,  Qm  = 0.56,  FOav  = 100  Hz,  Sk  = 0.00, 
Agmin  = 0.0  cm2',  and  a smoothed  subglottal 
pressure  contour. 
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Figure  2.40  Synthetic  sentence  "Ben  went  mining,"  using  the 
parametric  2-mass  model,  0 = 45,  Qo  = 0.7, 

Qm  = 0.6,  FOav  = 90  Hz,  Sk  = 0.0,  and 
Agmin  = 0.0  cm2,  and  no  sinuses. 
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Figure  2.41  Synthetic  sentence  "Good  bye,  Bob,"  using  the 
parametric  2-mass  model,  © = 55,  Qo  = 0.753, 
Qm  = 0.6,  FOav  = 100  Hz,  Sk  = 0.01, 
and  Agmin  = 0.0  cm2 . 
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Figure  2.42  Synthetic  sentence  "Good  bye,  Bob,"  using 
the  parametric  2-mass  model,  0 = 10, 

Qo  = 0.6,  Qm  = 0.6,  FOav  = 100  Hz, 

Sk  = 0.0,  Agmin  = 0.0  cm2. 
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Figure  2.43  Synthetic  sentence  "We  were  away  a year  ago," 

using  the  equivalent  glottal  area  model,  © = 5, 
Qo  = 0.65,  Qm  = 0.56,  F0av  = 100  Hz,  Sk  = 0 . 0 and 
Agmin  = 0.0  cm2 . 
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Figure  2.44  Synthetic  sentence  "We  were  away  a year  ago," 

using  the  equivalent  glottal  area  model,  © =15, 
Qo  = 0 . 96,  Qm  = 0.5,  F0av  = 100  Hz,  Sk  = 0.01  and 
Agmin  = 0.025  cm2. 
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Figure  2.45  Synthetic  sentence  "Good  bye,  Bob,"  using  the 
equivalent  glottal  area  model,  0=0.0,  Qo  = 0.8, 
Qm^  = 0.6,  FOav  = 100  Hz,  Sk  = 0.0,  Agmin  = 0.01 
cm  , and  a smooth  subglottal  pressure  contour. 
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Figure  2.47  Synthetic  sentence  "Ben  went  mining,"  using  the 
equivalent  glottal  area  model,  0 = 45,  Qo  = 0.7, 
Qm  = 0.6,  F0av  = 90  Hz,  Sk  = 0.0,  Agmin=0 . 0 cm2 
and  the  sinuses. 
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CHAPTER  3 

THE  INVERSE  MAPPING 
3 . 1 Introduction 

The  derivation  of  the  vocal  tract  area  function  from 
speech  is  an  inverse  problem.  The  values  of  the  articulatory 
model  are  inferred  from  the  measured  acoustic  parameters.  We 
attempt  a solution  to  this  problem  with  a nonlinear 
multidimensional  optimization.  The  combination  of  two 
techniques  proved  to  be  suitable  for  implementing  this 
approach:  successive  approximation  and  gradient  search.  The 
results  from  the  analysis  phase  of  both  formant  and  LPC 
synthesizers  (Fig.  1.36)  provided  the  desired  acoustical 
parameters  (formants)  for  the  generation  of  the  objective 
function.  In  a text-to-speech  context,  look-up  tables  can  be 
used.  The  parameter  vector  to  be  optimized  consisted  of  the 
coordinates  for  the  jaw,  tongue  body,  tongue  tip,  hyoid,  velum 
and  lips.  Once  the  optimum  articulatory  vector  was 
determined,  the  articulatory  model  provided  the  vocal— tract 
area  function  to  the  articulatory  synthesizer. 
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TABLE  3 DERIVATION  OF  VOCAL  TRACT  AREA  FUNCTION 

I.  DIRECT  METHODS 

1 . X-RAY  PHOTOGRAPHY  OR  HIGH-SPEED  CINERADIOGRAPHY 

- radiation;  difficult  identification  of  articulators. 

2.  X-RAY  MICROBEAM 


- Pellets  attached  to  the  articulators. 

- Less  radiation;  laborious  data  processing. 

3 .  COMPUTERIZED  TOMOGRAPHY  AND  MAGNETIC  RESONANCE  IMAGING 

II.  INDIRECT  METHODS 

1.  ACOUSTICAL  PULSE  REFLECTION 

- Measurement  and  analysis  of  incident  and  reflected  wave. 

2 . LINEAR  PREDICTION/ACOUSTIC  TUBE  MODEL  (LPAT) 

LPC  PARCOR  coefficients  vs  reflection 
coefficients  of  lossless  tube. 

- Only  for  voiced  and  non-nasal  phonemes. 

- Main  problems:  nonuniqueness  of  the  mapping,  uncertainty 
of  the  source,  limitations  of  the  model. 

3.  LIP  IMPULSE  RESPONSE  (LIR) 

- Area  derived  from  the  poles  and  zeros  of  the  terminal 
impedance  of  the  vocal  tract; 

- Sondhi's  Method:  impedance  tube  at  the  lips;  recovery  of 
dynamic  variations  in  real  time;  "silent"  articulation. 

- Milenkovic  and  Muller:  throat  accelerometer,  field 
microphone  and  spectrum  analyzer. 

4.  NUMERICAL  APPROACHES 

- Generation  of  codebooks; 

- Computer  sorting  and  table  look-up. 

5 . FEEDBACK  SYSTEMS 

- Minimization  of  distances  between  features  of  the 
original  and  synthetic  speech. 

6.  NEURAL  NETWORK 


Supervised  learning  model. 
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3 . 2 Main  Issues  in  the  Area  Function  Derivation 

Section  1.4. 2. 5 presented  a brief  overview  of  the  methods 
used  to  obtain  the  vocal  tract  area  function.  Here  we  discuss 
in  more  detail  the  problems,  limitations,  advantages  and 
drawbacks  of  the  several  techniques . Table  3 summarizes  the 
methods  that  have  been  used  by  the  researchers . 

3.2.1  The  Direct  Methods 

The  results  obtained  in  the  past  by  the  two  direct 
methods,  X-ray  photography  and  X-ray  microbeam,  are  still  the 
source  of  data  for  many  synthesizers  (Fant,  1960/  Perkell, 
1969)  . However,  these  methods  are  no  longer  suitable,  by 
virtue  of  the  danger  that  the  exposure  to  radiation 
represents.  Besides,  even  with  the  use  of  small  pellets 
attached  to  the  articulators,  the  identification  of  shapes  and 
the  data  processing  are  difficult  and  laborious. 

Computer  tomography  (Johansson  et  al . , 1983)  and  magnetic 
resonance  imaging  (MRI ) have  proved  to  be  more  convenient  and 
reliable  techniques  to  obtain  direct  measurements  of  the  vocal 
tract . 

3.2.2  The  Theoretical  Basis  for  the  Inverse  Mapping 

The  analytical  solution  for  the  inverse  mapping  problem 
has  been  developed  according  to  three  rationales  or  lines  of 
reasoning : 

(1)  The  Portnoff  equations  (Table  1.3,  equations  1) 
relate  pressure  and  volume-velocity  as  functions  of  the 
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position  along  the  tube,  x,  and  of  the  time,  t.  The  area 
function  A(x,t)  is  a parameter.  If  boundary  conditions  are 
known  at  the  ends  of  the  tube  (the  glottis  and  the  lips)  a 
numerical  solution  can  be  found.  The  bandwidths  of  the  signal 
parameters  that  govern  the  motion  of  the  articulators  is  small 
when  compared  with  the  bandwidths  of  the  acoustic  signal. 
Consequently,  the  tract  can  be  considered  stationary  for  small 
time  intervals,  for  the  purpose  of  deriving  the  area  along  the 
tube.  It  follows  that,  if  a functional  relationship  can  be 
established  between  the  pressure  and  the  volume-velocity,  then 
the  eigenvalues  of  this  function  may  provide  the  area 
information . 

(2)  For  voiced  phonemes,  the  speech  signal  is  the  result 
of  the  convolution  of  the  glottal  excitation  with  the  transfer 
function  of  the  vocal  tract.  If  the  excitation  can  be 
estimated,  then  the  area  function  may  be  derived  from  the 
transfer  function. 

(3)  If  the  acoustic  features  of  the  original  speech  are 
known,  then  an  iterative  process  that  minimizes  the  distance 
between  these  features  and  those  generated  by  the  articulatory 
model  may  lead  to  an  appropriate  set  of  articulatory 
parameters . 

A breakthrough  was  achieved  by  Mermelstein  and  Schroeder 
(1965)  using  an  iterative  first-order  perturbation  analysis  of 
the  Webster  horn  equation.  Their  theoretical  results  comply 
with  our  first  rationale  for  solving  the  inverse  mapping 
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problem,  i.e,  two  infinite  sets  of  eigenvalues  of  the  lip 
impedance  (zeros  and  poles,  for  example),  under  different 
boundary  conditions,  uniquely  specifies  the  area  function. 
This  approach  was  adopted  and  improved  upon  by  Sondhi  (1974, 
1979),  and  co-workers,  Gopinath  (Sondhi  and  Gopinath,  1971), 
Resnick  (Sondhi  and  Resnick,  1983)  . It  is  called  the  "Lip 
Impulse  Response"  Method. 

The  "Linear  Prediction/Acoustic  Tube  Model  (LPAT) " 
method,  adopted  by  Wakita  (1973,  1979)  complies  with  our 
second  rationale.  The  reflection  coefficients  r±  in 
concatenated  lossless  tubes  relate  the  cross-sectional  areas 
A±  and  Ai+1  of  two  adjacent  tubes.  The  coefficients  rx  and  the 
LPC  partial  correlation  (PARCOR)  coefficients  ki  are  related 
by  the  equation  ri  = - kL  (Rabiner  and  Schafer,  1978,  Section 
8.7)  . Therefore,  if  the  PARCOR  coefficients  can  be  estimated 
without  being  affected  by  the  glottal  source  or  by  the 
radiation  (using  pre-emphasis  or  closed-phase  analysis,  for 
example),  then  an  acceptable  area  function  can  be  derived. 

Finally,  the  third  rationale  is  the  basis  for  the 
"Feedback  or  Analysis/Synthesis  Method."  The  main  concern  in 
this  method  is  choosing  an  appropriate  distortion  measure 
(between  features  of  the  real  and  synthetic  speech)  and 
selecting  an  efficient  algorithm  for  the  minimization  of  the 
this  distortion.  Physiologically  impossible  vocal  tract 
configurations  must  be  excluded  from  the  solutions,  by 
imposing  constraints. 
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3.2.3  Advantages  and  Disadvantages 

3. 2. 3.1  Lip  impulse  response  method 

Two  versions  of  this  method  have  been  realized  in  the 
frequency  domain.  The  original  approach,  by  Schroeder  and 
Mermelstein  (1965) , required  an  infinite  number  of  poles  and 
zeros,  but  only  a finite  number  of  eigenvalues  could  be 
measured  in  the  practical  realization.  Extensive  experimental 
results  proved,  however,  that  the  set  of  "n  lowest-order  poles 
and  zeros  . . . uniquely  determined  the  2nd  lowest-order  Fourier 
components  of  the  logarithmic  area  function"  (Mermelstein, 
1967,  p.324) . By  modifying  the  Fourier  coefficients 
iteratively,  using  perturbation  theory,  an  approximate  area 
function  (band-limited)  was  derived  as  an  extension  of  the 
uniform  tube,  which  was  used  as  the  initial  configuration. 
Gopinath  and  Sondhi  (1970)  improved  upon  this  approach  using 
the  poles  and  residues  of  the  input  impedance.  By  using 
asymptotic  values  for  the  higher  order  poles  and  residues, 
discontinuities  in  the  area  function  could  be  handled. 
Despite  these  developments,  Sondhi  reported  that  these 
frequency-domain  approaches  presented  two  major  limitations: 

(1)  The  vocal  tract  length  and  the  boundary  condition  at 
the  glottis  must  be  known  a priori,  in  order  for  a good 
approximation  to  the  real  area  function  to  be  derived. 

(2)  For  estimating  the  eigenvalues,  a relatively  long 
time  interval  had  to  be  used  (10-20  ms),  impairing  the 
assumption  of  stationarity . 
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To  overcome  these  problems,  Sondhi  and  co-workers  (Sondhi 
and  Gopinath,  1971;  Sondhi  and  Resnick,  1983)  adopted  time- 
domain  measurements.  The  apparatus  consisted  of  an  impedance 
tube  with  an  acoustic  coupler,  gripped  by  the  subject's  lips, 
a transducer,  a microphone,  an  anti-aliasing  filter  and  an 
analog/digital  converter,  under  the  control  of  a computer. 

Very  good  results  were  obtained  with  this  approach.  Its 
advantages  were: 

(1)  The  vocal  tract  length  and  the  glottal  boundary 
condition  were  no  longer  initial  specifications; 

(2)  The  measurements  and  processing  were  fast  enough 
(about  1 ms)  to  recover  the  time-varying  shape  of  the  vocal 
tract  (real-time  dynamic  variation) . The  measurement, 
computation  and  display  of  the  area  function  could  be 
accomplished  in  about  18  times  per  second; 

The  reconstructed  area  function  can  provide 
"intelligible  though  not  high  quality  speech"  (Sondhi  and 
Resnick,  1983,  p.985)  . 

Wakita  (1979)  pointed  out,  however,  that  the  LIR  method 
presents  a major  drawback:  the  attachment  of  the  acoustic  tube 
to  the  mouth  (kept  permanently  closed)  constrains  the 
movements  of  lips  and  jaws,  making  it  difficult  to  control  the 
lip  opening  and  protrusion.  This  "silent"  articulation  of  the 
sounds  could  impair  the  data  analysis  and  thereby  affect  the 
quality  of  the  synthesized  speech. 
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3. 2. 3. 2 Linear  prediction/acoustical  tube  model 

Two  initial  problems  of  this  method  were  the  uncertainty 
of  the  excitation  source  and  the  nonuniqueness  of  the  area 
function  derived  from  the  vocal  tract  transfer  function.  The 
first  problem  was  partially  solved,  by  preemphasizing  the 
speech  signal  to  remove  the  effects  of  the  source  and  of  the 
radiation  (Wakita,  1973)  . An  improved  solution  came  about, 
however,  with  the  reconstruction  of  the  volume-velocity  by 
inverse  filtering  the  speech  during  the  glottal  closed  phase 
(Wakita,  1979) . The  second  problem,  ambiguity,  arises  from 
the  fact  that  the  transfer  function  of  a lossless  tube 
provides  only  one  set  of  eigenvalues,  the  poles.  The  derived 
area  may  be  physiologically  impossible.  A set  of  specific 
distributions  of  losses  in  the  vocal  tract  can  guarantee  a 
solution,  but  its  not  well  known  if  realistic 
distributions  of  losses  belong  to  this  set  (Atal  et  al . , 
1978)  . Other  options  are  to  impose  constraints  on  the 
position  of  the  tongue  and  other  articulators  or  to  collect 
sufficient  data  to  resolve  the  ambiguities  (Milenkovik,  1984)  . 

Other  problems  are  related  to  the  limitations  of  the  LPC 


model.  First,  the  model  excludes  nasals  and  unvoiced  sounds. 
The  assumed  boundary  conditions  (no  radiation  load  and 
termination  with  an  uniform  an  infinite  tube  at  the  glottis) 
are  unrealistic.  The  losses  (heat  conduction,  viscous 
friction  and  yielding  walls)  are  not  considered  in  the  model. 
Wakita  (1979)  suggested  the  use  of 


conversion  charts  for  the 
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formant  frequencies  and  bandwidths  to  minimize  the 
discrepancies  in  the  final  results. 

Another  disadvantage  is  having  to  accurately  estimate  the 
vocal  tract  length,  because  the  area  function  is  sometimes 
very  sensitive  to  length  variations.  On  the  other  hand,  even 
with  these  limitations,  the  method  is  considered  (Wakita, 
1979/  Wakita  and  Gray,  1975)  capable  of  providing  fair 
results,  without  the  need  of  special  equipment. 

3 . 2 . 3 . 3 Codebook  generation:  numerical  approaches 

Instead  of  dealing  with  inverse  mapping  analytically, 
Atal  et  al.  (1978),  adopted  a numerical  approach:  computer - 

sortinq technique . Their  numerical  approaches  set  up  a 

codebook  by  exhaustive  computation.  A large  number  of 
articulatory  vectors  x and  their  associated  acoustical  vectors 
y were  calculated.  The  resulting  pairs  (x,y)  were  sorted, 
according  to  y.  Given  a y,  the  corresponding  value  of  x is 
obtained  by  looking  up  y in  the  sorted  data.  The  components 
of  the  articulatory  vector  x were  the  parameters  of  the 
articulatory  model  shown  in  Fig.  1.12,  which  is  an  extension 
of  the  model  by  Stevens  and  House  (1955) . The  acoustical 
vector  was  formed  by  the  first  three  formants. 

The  numerical  approach  provides  the  basis  for  the  vector 
quantization  of  the  articulatory  space,  presented  by  Larar  et 
al.  (1988).  The  initial  step  is  to  select  target 
configurations  for  the  vocal  tract  and  to  obtain  by 
interpolation  the  articulatory  vectors  x that  span  the 
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articulatory  space  and  the  corresponding  acoustical  vectors  y 
(LPC  coefficients,  for  example) . Then,  the  "training  set" 
formed  by  the  pairs  (x,y)  is  clustered,  according  to  a 
suitable  measure  of  similarity  for  the  acoustical  vectors, 
constituting  the  codebook.  Larar  et  al.  (1988)  used  a 
modified  k-means  clustering  algorithm  and  the  likelihood  ratio 
for  the  definition  of  distance.  Given  an  arbitrary  acoustical 
vector  y,  the  closest  articulatory  vector  of  the  codebook 
(cluster  "representant")  will  provide  the  linked  articulatory 
vector  x. 

This  numerical  approach  has,  in  general,  the  same 
drawbacks  as  previous  methods,  namely,  computational  burden, 
sensitivity  to  the  source  excitation,  ambiguity  of  the 
mapping,  and  limitations  of  the  acoustical  model,  etc.  They 
are  suitable  for  deriving  initial  articulatory  configurations. 
However,  their  codebooks  do  not  cover  all  the  phonemes. 

Schroeter  et  al.  (1990)  made  some  improvements  for  the 
generation  of  codebooks  of  the  articulatory-acoustic  pairs. 
The  nonuniqueness  of  the  mapping  was  attacked  by  using  a 
dynamic  programming  search.  Two  "important  findings .. .were 
that  the  FFT-derived  cepstral  distance  measures  outperform 
LPC-derived  cepstral  distances,  and  that  cepstral  liftering  is 
helpful  in  accommodating  glottal  variability"  (Schroeter  et 
al.,  1990,  p.  393)  . Results  about  the  performance  of  weighted 
cepstral  distortion  for  accessing  articulatory  codebooks  is 
given  by  Meyer  et . al  (1991). 
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Artificial  neural  networks  (ANN)  is  a promising  approach 
to  implement  codebooks  (Xue  et  al . , 1990)  due  to  their 
properties  of  self-learning  and  memory  association.  Xue  et 
al.  chose  the  multi-layer  perceptron  as  the  supervised 
learning  model  and  used  a singular  value  decomposition  to 
reduce  the  redundant  hidden  units.  The  articulatory  vectors 
x (output)  are  the  parameters  of  Mermelstein' s model  and  the 
associated  acoustic  vectors  y (input)  are  LPC  codes  or  formant 
frequencies.  Given  y,  the  corresponding  x is  retrieved,  since 
the  artificial  neural  network  "learned"  all  of  the  pairs 
(y,x) . Two  learning  algorithms  were  tested:  back-propagation 
(BP)  and  random  optimization  (RM)  . The  BP  algorithm  faced  the 
problem  of  being  trapped  by  local  minima  and  presented  a heavy 
computational  cost  (for  a large  number  of  input  patterns) . 
The  RM  algorithm  performed  better  for  larger  training  patterns 
but  put  "some  constraints  on  the  ranges  of  the  weight  values" 
(Xue  et  al.,  1990.  p.869).  Xue  et  al . (1990)  concluded  that 
the  present  capability  of  an  artificial  neural  network  is  only 
to  provide  initial  values  for  the  articulatory  parameters,  for 
a relatively  small  set  of  training  patterns. 

Rahim  and  Goodyear  (1990)  also  used  a multi-layer 
perceptron  for  vocal  tract  estimation.  Initial  vocal  tract 
configurations,  found  through  a 10th  order  covariance 
analysis,  were  optimized  using  gradient  search,  the  same 
scheme  adopted  by  Levinson  and  Schmidt  (1983)  . Then,  the 
resultant  area  functions  were  used  as  input  data  for  a 4-layer 
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perceptron.  It  was  reported  that  the  final  areas  "were 
generally  within  20%  of  those  of  the  targets"  (p.  53)  . To 
accommodate  the  ambiguity  in  the  mapping,  Rahim  et  al.  (1991) 
used  an  assembly  of  multi-layer  perceptron. 

Kobayashi  et  al . , 1991  used  optimized  cepstral  parameters 
as  the  input  of  a four-layer  neural  network.  Dealing  with  the 
articulatory  dynamics  provide  by  the  mapping  is  reported  to  be 
a "remaining  problem." 

3. 2. 3. 4 Feedback  Methods 

Flanagan  et  al.(1975)  used  the  minimum  squared  error 
between  the  synthesized  and  the  original  spectra  as  the 
distance  for  their  adaptation  scheme.  They  reported  that  the 
simultaneous  adaptation  of  all  parameters  (vocal  folds  and 
vocal  tract)  proved  to  be  unsatisfactory.  The  squared 
difference  between  the  logarithm  of  the  squared  magnitude  of 
the  synthesized  and  the  original  spectra,  summed  over  all 
frequencies,  was  chosen  as  the  objective  function.  A proper 
sequence  of  single-parameter  optimization  was  applied  to 
derive  favorable  initial  positions  (Flanagan  et  al.,  1980). 
The  Hooke  and  Jeeves  algorithm  was  used  for  the  final 
multiple-parameter  optimization.  The  articulatory  model  was 
the  six-parameters  model  by  Ishizaka  (Fig.  1.13)  and  the  mouth 
area  was  measured  optically  (Polaroid  photography  and 
planimetry) . The  time  for  adaptation  (2  glottal  parameters 
and  4 tract  parameters)  for  a 12.8  ms  frame  was  reported  as  26 
min,  using  a CRAY  I. 
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Levinson  and  Schmidt  (1983)  used  the  same  error  function 
as  Flanagan's  scheme,  the  optimal  gradient  algorithm  by 
Shapiro,  and  an  articulatory  model  based  on  Coker's  work  (Fig. 
1.9) . According  to  the  authors,  the  method  "can  perform  quite 
well  but  is  not  robust"  (Levinson  and  Schmidt,  1983,  p.  1154)  . 
Only  vowels  and  diphthongs  were  considered  and  the  main 
difficulties  were  reported  to  be  the  instability  of  the  LPC 
spectral  estimates  and  a problem  they  called  "ventriloquist 
effect."  This  effect  occurs  when  one  or  more  parameter 
coordinates  remain  fixed  while  the  others  vary  in  an  attempt 
to  compensate  for  the  error  function.  This  results  in 
physiologically  possible  but  non  realistic  configurations. 

Parthasarathy  and  Coker  (1990)  also  used  the  Hooke  and 
Jeeves  algorithm  for  the  optimization  of  articulatory 
parameters  derived  from  Coker's  model.  Three  criteria  were 
adopted  as  the  objective  function:  the  same  "spectral 
difference"  used  by  Flanagan  et  al  (1975),  in  the  initial 
stages  of  optimization,  the  "spectral-slope  error"  (reported 
as  insensitive  to  glottal  effects) , in  the  final  stages,  and 
finally  the  phone  "rate  of  transition."  The  phoneme 

durations  were  previously  optimized  using  a dynamic  time- 
warping algorithm. 

Gupta  and  Schroeter  (1991)  also  used  the  Hooke  and 
Jeeves  algorithm  for  the  optimization  of  the  vocal  tract 
parameters  (and  glottal  parameters  too) . They  used  "single- 
frame optimization"  (adaptation  of  parameters  once  every  pitch 
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period)  and  "multi-frame  optimization"  (once  every  few  pitch 
periods) . Cepstral  distance  measures  were  reported  to 
perform  better  than  log-spectral  distance  measures  for  the 
optimization  of  the  vocal  tract. 

3. 2. 3. 5 Other  methods 

The  method  of  Ladefoged  et  al . (1978)  consists  basically 
of  recovering  vocal  tract  sagittal  distances,  using  the 
correlation  between  articulatory  vectors  (represented  by  two 
tongue  shape  components  and  by  the  lip  distance)  and 
acoustical  vectors  (represented  by  the  first  three  formants) . 
It  applies  only  for  vowels  and  depends  on  the  initial 
availability  of  x-ray  data  for  the  description  of  the  tongue 
shapes.  Perturbation  techniques  have  been  used  in  different 
ways  by  several  researchers:  Mermelstein  and  Schroeder  (1965), 
Atal  et  al . (1978) , Fant  et  al . (1988),  and  Lin  (1990).  Lin 
used  as  components  of  the  articulatory  vector  the  constriction 
area,  its  axial  location  and  the  ratio  of  the  length  of  the 
lip  section  to  the  area  of  this  section  (from  Fant's 
articulatory  model) . The  initial  values  were  selected  by 
empirical  rules  and  the  interaction  was  controlled  by  the 
distance  between  the  real  and  calculated  formants,  using  a 
number  of  sub-segments  of  this  distance. 
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3 . 3 Inverse  Mapping  Approach 
3.3.1  Choice  of  the  Method 

We  selected  the  optimization  method  by  weighing  the 
advantages  and  disadvantages  of  the  several  methods  discussed 
(Section  3.2)  and  by  considering  the  resources  available  for 
computation . 

The  optimization  problem  can  be  stated  as  follows: 

Given  the  acoustic  vector  y,  what  is  the  articulatory 
vector,  x,  such  that  the  distance  between  y and  y'=f(x)  is 
minimal  ? 

An  optimum  seeking  procedure  that  minimizes,  iteratively, 
the  error  between  features  of  the  synthesized  and  original 
speech  is  needed.  The  routines  for  analysis  of  the  speech 
signal  are  available  (Chapter  2)  . The  smoothed  formant 
contour  of  sentences  can  be  obtained  from  the  analysis,  or,  in 
a text-to-speech  context,  typical  formant  frequency  and 
duration  values  are  available  for  all  allophones  (Klatt, 
1987)  . The  error  obtained  from  the  comparison  between  the  two 
sets  of  the  first  three  formants,  model-derived  and  speech- 
derived,  is  suitable  for  the  application  of  constrained 
optimization  techniques.  The  acoustic  vector  y is  composed  of 
the  first  three  formants  (F1F , F2F , F3F)  and  the  articulatory 
vector  x is  composed  of  the  coordinates  of  the  jaw  (TETJ)  , 
tongue  body  (TOX,TOY),  tongue  tip  (TX, TY) , lips  (HL,PPL),  and 
hyoid  (HY) . These  parameters  are  described  in  Sec  2.2.1.  Our 
implementation  is  able  to  handle  both  vowels  and  consonants. 
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The  velum  coordinates  are  initially  those  of  the  default 
position  (for  nasal,  non-nasal  or  nasalized  phoneme)  but  they 
are  also  allowed  to  change  in  order  to  follow  the  small 
variations  of  the  velum  (for  example,  the  velum  elevation  is 
different  for  different  vowels) . The  dimensions  of  the  lower 
pharynx  are  also  allowed  to  change  whenever  a significant 
decrease  in  the  error  function  occurs. 

3.3.2  Strategy  for  the  Optimization  Procedure 

For  the  synthesis  of  sentences,  the  inverse  mapping 
approach  is  applied  to  targets  (Flk,  F2k,  F3k)  , k=l  to  N,  on  the 
formant  contour.  The  selection  of  the  N targets  is  based  on 
the  results  of  the  analysis  of  the  speech  signal,  including 
location  of  word  endpoints  and  phoneme  boundaries.  An 
analysis  frame  of  256  samples  (Hamming  window) , with  an 
overlap  of  64  samples,  is  used  in  the  analysis  subroutines, 
which  includes  energy,  zero  crossing  rate,  spectral  analysis, 
and  fundamental  frequency.  The  area  functions  for  the  points 
in  between  the  targets  are  estimated  using  interpolation, 
with  negligible  distortion  in  the  synthetic  speech  (Shadle  and 
Atal,  1979).  Figure  3.1  illustrates  this  strategy  for  the 
synthesis  of  the  sentence  "We  were  away  a year  ago." 

3.3.3  Description  of  the  Optimization  Method 

This  inverse  mapping  is  stated  as  a "constrained 
multidimensional  nonlinear  optimization  problem."  The  design 
vector  is  the  articulatory  vector  x.  The  objective  function 
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Fig.  3.1  Targets  for  the  optimization  approach  (formant 
contour  provided  by  the  MMIRC  formant 
synthesizer,  Pinto  et  al . , 1989). 
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(or  misfit  function)  e(F)  is  generated  by  the  model-derived 
and  by  the  target  formants,  using  a percent  least-absolute- 
value  (li-norm)  criterion  error: 

e (F)  = Ex  Wx  I (Fmi  -Ft±) /FtJ  % 
where:  i:  1 to  3 

Fmi:  model-derived  formants 

Ftx : target  formants 

Wx:  weights,  W2  > Wx  > W3 

The  constraints  are  not  only  the  articulatory-to-acoustic 
operator  f but  also  the  boundary  conditions  for  the 
articulatory  parameters,  as  follows: 
y = f(x) 

LBk  < xk  < UBk  k:  1 to  8 

xp  < PB  if  xq  > QB 

where : 

x:  articulatory  vector 
y:  acoustical  vector 

LBk  and  UBk:  lower  and  upper  absolute 

thresholds  for  parameter  k 
PB  and  QB:  relative  thresholds  between 
parameters  p and  q 

The  constraint  boundaries  are  experimentally  determined 
in  order  to  avoid  physiologically  impossible  configurations, 
e.g.,  tongue  tip  beyond  the  tract  walls,  constriction  area  too 
small  (thereby  capable  of  generating  turbulent  noise) , point 
PP  (Fig.  2.1)  inside  the  tongue  body  circle,  etc.  The  forward 
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transformation  f,  which  maps  an  articulatory  vector  into  the 
an  acoustic  vector,  and  the  inverse  transformation  g,  which 
maps  an  acoustic  vector  into  an  articulatory  vector,  are  both 
non-linear  (Charpentier,  1984)  . The  transformation  f,  such 
that  y=f  (x) , is  unique,  but  the  transformation  g,  such  that, 
x=g(y)  is  not  unique  or  may  not  exist  (Atal  et  al.,  1978), 
when  the  articulatory  vector  has  more  components  than  the 
acoustic  vector.  Thus,  the  articulatory  dynamics  must  be 
considered  carefully,  since  during  the  optimization  process, 
more  than  one  "optimal"  articulatory  vector  may  be  derived 
from  the  same  acoustic  vector.  Depending  on  the  initial 
condition,  the  resulting  articulatory  vector  may  not  represent 
a physiologically  possible  configuration. 

Our  approach  consists  of  applying  successively  a 
constraint  approximation  and  a feasible  direction  method,  more 
specifically,  the  linear  successive  approximation  and  the 
gradient  search  techniques.  Even  though  only  one  of  these 
techniques  can  be  used  in  our  realization,  with  quite  good 
results,  a sequence  of  steps  through  the  linear  approximation 
and  the  gradient  search  allows  a faster  convergence,  normally 
with  final  errors  less  than  1%,  for  both  vowels  and 
consonants.  Normally,  when  one  of  the  techniques  is  not 
capable  of  further  minimizing  the  error,  the  other  one  is  able 
to  continue  the  optimum  seeking  process,  until  the  desired 
error  is  achieved.  It  is  known,  for  example,  that  the 
convergence  of  lx-norm  criterion  systems  may  be  slow  near  the 
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global  minimum  and  that,  in  addition,  multiple  and  secondary 
minima  and  saddle  points  may  occur  if  the  constraint 
transformation  (f,  in  our  case)  is  non-linear.  By 

complementing  the  gradient  search  with  a successive 
approximation  (different  feasible  directions)  these  drawbacks 
are  minimized  and  also  priorities  can  be  assigned  to  the 
articulators  to  improve  the  articulatory  dynamics. 

Successive  approximation.  The  linear  successive 
approximation  is  illustrated  in  Fig.  3.2.  The  articulatory 
vector  x,  a function  of  three  parameters,  with  the  others 
fixed) , is  represented  in  its  initial  position  as 
xG  (Xlo, X2o, X3o) . The  corresponding  acoustic  vector  is 
y0(Fl0,F20,F30) , determined  by  the  articulatory  model.  The 
target  in  the  acoustic  domain  is  y (tL)  . If  the  acoustic 
vector  shift  or  perturbation  Ay  = y(ti)-y0  is  small  enough 
(infinitesimal),  the  corresponding  articulatory  vector  shift 
Ax  = x (ti)  — x0  can  be  determined  as  long  as  the  sensitivities 
of  the  three  formants  with  respect  to  the  three  articulatory 
parameters  3Fk/3xi,  5Fk/ax2,  aFk/8x3,  k = 1 to  3,  are  known. 
The  matrix  of  sensitivities  [A],  shown  in  Fig.  3.2,  is 
estimated  by  using  backward  differences.  Then  the  linear 
system  of  equations  [A]  Ax  = Ay  is  solved.  The  vector  x(tj 
= xQ  + Ax  is  the  desired  target  in  the  articulatory  domain. 
A linear  approximation  is  obtained  provided  that,  whenever  any 
component  of  the  shift  Ay  (AF1,  AF2,  or  AF3)  is  greater  than 
its  threshold,  a fractional  value  is  assigned  to  this  shift 
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' dFl/dXl  dFl/dX2  dFl/dX3 
dF2/dXl  dF2/dX2  dF3/dX3 
\dF3/dXl  dF3/dX2  dF3/dX3 

[A] Ax  = Ay 

*1*5  =~xa  + ££ 


Figure  3.2  Linear  successive  approximation  method. 
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component.  Stated  another  way,  intermediate  targets  between 
the  initial  position  and  the  final  target  are  reached,  using 
piecewise  linear  segments. 

A percent  weighted  average  of  the  acoustic  vector  shift 
components  generates  the  error  function  e(x)i  for  the 
optimization  scheme. 

Since  only  three  articulatory  parameters  can  be  used  each 
time,  the  process  is  repeated  successively  for  the  other  sets 
of  three  parameters,  using  the  updated  values  of  the  acoustic 
shift  Ay  and  of  the  articulatory  vector  x(ti).  Three  sets 
have  proved  to  be  appropriate  for  the  optimization: 

Set  1:  jaw  (X1=TETJ)  and  tongue  body  (X2=T0X,  X3=T0Y) ; 

Set  2:  lips  (X1=HL)  and  tongue  tip  (X2=TX,  X3=TY) ; 

Set  3:  hyoid  (X1=HY)  and  velum  (X2=VX,  X3=VY) . 

The  best  sequence  for  the  sets,  in  most  of  the  cases,  depends 
on  the  position  of  the  constriction  in  the  vocal  tract.  For 
vowels,  the  sequence  1 "Set  1,  Set  2,  Set  3"  is  the  best, 
while  for  consonants,  the  sequence  2 "Set  2,  Set  1,  Set  3"  is 
the  most  efficient . The  range  of  values  for  the  tongue  tip  in 
"sequence  1"  is  around  the  default  position. 

In  order  to  avoid  physiologically  impossible 
configurations,  the  values  of  the  articulatory  parameters  are 
constrained  to  specific  ranges.  It  may  happen  that  the 
expected  value  of  one  or  more  parameter  may  be  close  to  being 
out  of  bounds  after  the  operation  x(t±)  = xQ  + Ax.  In  this 
case,  the  value  of  the  particular  threshold  is  assigned  to  the 
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parameter.  After  that,  if  the  error  function  increases,  the 
articulatory  vector  returns  to  its  previous  position. 

Gradient . Two  schemes  have  been  developed  to  provide 
flexibility  to  the  approach.  The  first  scheme  uses  a 8- 
dimensional  articulatory  vector  and  the  second  one  uses 
sequentially  a 6-dimensional,  a 5-dimensional  and  a 2- 
dimensional  articulatory  vector. 

A special  algorithm,  shown  in  Fig.  3.3,  is  applied  to 
accelerate  the  convergence  of  this  optimization.  The  gradient 
is  estimated  by  using  backward  differences.  For  the 
iterations  3,  6 and  9,  the  search  direction  is  similar  to  that 
used  in  the  Fletcher-Reeves  Method  (Rao,  1984),  which  depends 
on  the  current  and  previous  gradients.  For  the  other 
iterations  a normalized  steepest  descent  direction  is  used. 
Whenever  the  error  e(x)1  (i: iteration  reference  number)  is 

greater  than  its  previous  value  etx)^,  the  step  length  is 
corrected  and  the  articulatory  vector  returns  to  its  previous 
position.  The  optimum  step  size  is  estimated  by  using 
interpolation  (Hamming,  1971)  . Whenever  the  articulatory 
vector  lies  in  an  infeasible  region,  the  threshold  values  are 
assigned  to  the  parameters  that  have  crossed  the  boundaries. 
The  iteration  continues  until  one  of  the  following  conditions 
are  met : 

Error  D = e(x)i  < 1%; 

Relative  error  E = I e(x)1  - e(x)1_1|  / e(x)i  < 0.001; 

Number  of  iterations  NLOO  > 10 
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Figure  3.3  Approach  for  the 


gradient  algorithm. 
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The  8-dimensional  articulatory  vector  "arv8"  contains 
the  jaw  (TETJ),  tongue  tip  (TX,TY),  tongue  body  (TOX,TOY), 
lip (HL, PPL),  and  hyoid  (HY)  parameters  (described  in  Section 
2.2.1.)  . In  the  optimization  process,  the  tongue  and  the  jaw 
are  taken  as  independent  articulators,  because  the  tongue  can 
also  move  with  respect  to  the  maxilla,  while  the  jaw  is  kept 
fixed  (Scully,  1987) . However,  the  relationship  between  the 
tongue-body  center  and  the  jaw  angle  (from  Mermelstein' s 
model)  is  used  to  establish  their  initial  values.  The  default 
values  for  the  tongue  tip  and  hyoid  are  also  used  as  initial 
positions . 

The  5-dimensional  vector  "arv5"  is  related  to  the  front 
cavity  of  the  vocal  tract  and  contains  the  lip  (HL,PPL),  jaw 
(TETJ)  and  tongue  tip  (TX, TY)  parameters.  The  2-dimensional 
vector  "arv2"  contains  only  the  tongue  body  (TOX,TOY) 
parameters.  It  is  used  to  find  appropriate  start  positions 
for  the  tongue.  For  vowels  (Section  2.2.1)  the  tongue  tip 
position  can  be  estimated  from  the  tongue-body  coordinates. 
This  relationship  is  kept  when  the  "arv2"  vector  is  used.  The 
6-dimensional  vector  "arv6"  is  related  to  the  back  cavity  of 
the  vocal  tract  and  contains  the  hyoid  parameter  and  the  five 
dimensions  of  the  lower  part  of  the  pharynx  (G2-G,  G2-L,  W-G2, 
Wl-H  and  Gl-K  in  Fig.  2.1),  which  is  normally  considered  fixed 
in  most  articulatory  models.  The  area  function  in  the  lower 
pharynx  is  not  the  same  for  different  phonemes  and  for 
different  pitch  phonations  (Boone,  1971;  Johansson  et  al., 
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1983) . Consequently,  fine  adjustments  in  the  lower  pharynx 
are  necessary  to  obtain  smaller  final  errors.  Two  examples  to 
illustrate  this  characteristic  are: 

(1)  The  second  formant  F2  of  the  vowel  i (Swedish)  "is 
more  sensitive  to  a variation  of  the  pharyngeal  width  than  to 
the  degree  of  palatal  constriction"  (Wood,  1986); 

(2)  Lowering  the  larynx  lengthens  and  widens  the  lower 
pharynx  (Wood,  1986)  . 

In  addition,  since  the  pharynx  length  varies  according  to 
the  age  and  sex  (Traunmuller,  1984),  "arv6"  is  useful  in  the 
modeling  of  female  and  child  voices.  The  application  of  the 
gradient  approach  using  the  "arv2,"  "arv5"  and  the  "arv6" 
vectors,  in  a proper  sequence  (according  to  the  position  of 
the  constriction)  can  often  lead  to  good  results. 

3.3.4  Implementation  of  the  Optimization  Method 

A menu-driven  program  with  graphic  capability  implements 
the  optimization  approach  summarized  in  the  Section  3.3.3. 
Options  for  the  combination  of  the  successive  approximation 
and  gradient  approaches  are  available.  The  whole  menu 
consists  of  four  parts: 

Part  1:  Interactive  Articulatory  Model; 

Part  2:  Automatic  Multidimensional  Optimization; 

Part  3:  Univariate  Optimization; 

Part  4:  Utilities. 
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Part  1 . This  part  implements  the  macros  described  in 
Section  2.2.3.  Our  optimization  scheme  can  start  with  any 
vocal  tract  configuration.  However,  it  is  advantageous  to  use 
Part  1 as  a kind  of  "pilot"  for  the  optimization  scheme.  If 
some  information  about  the  phoneme  is  available,  for  example, 
the  features  listed  in  Table  1.1  or  sketches  of  vocal  tract 
configurations  (Mermelstein,  1972;  Harchman  et  al . , 1977; 
Ladefoged  et  al.,  1978;  Levinson  and  Schmidt,  1983;  Scully, 
1984,  etc.),  this  phase  can  be  very  useful,  not  only  to 
improve  the  convergence  of  the  minimization,  but  also  to  avoid 
configurations  whose  articulatory  dynamics  are  not  realistic. 
The  macros  of  Part  1 provide  good  estimates  for  the  initial 
positions  of  the  articulators,  mainly  for  the  tongue-body 
center  (default  value  for  vowels)  and  tongue  tip  (constriction 
of  consonants) . All  the  initial  values  of  the  articulatory 
parameters  are  written  into  the  file  "arvec.in."  The  value  of 
a particular  parameter  can  be  modified  and  the  sagittal  lines 
can  be  checked  during  the  intervals  of  the  optimization  phase. 

Part  2 . Part  2 (Automatic  Multidimensional  Optimization) 
includes  the  macros  for  the  successive  approximation  and 
gradient  searches,  for  the  drawings  and  tables  on  the  screen 
and  for  the  input/output . Figure  3.4  shows  the  input/output 
of  the  process  (input/output  files  written  in  lower  case) . 

After  each  stage  of  optimization,  the  percent  absolute 
error  for  each  formant  is  calculated  and  compared  with  the 
corresponding  JND  (Just  Noticeable  Difference)  proposed  by 
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Fig.  3.4 


Input/output  in  the  optimization  method. 
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Ghitza  and  Goldstein  (1986)  and  quantified  by  Parthasarathy 
and  Coker  (1990) . If  the  individual  errors  for  FI,  F2  and  F3 
are  within  the  range  in  which  a change  in  the  formant 
frequency  value  causes  no  perceptual  effect  (Error  Fx  < JND  Flf 
i = 1 to  3)  , a "Y"  will  be  written  onto  the  screen,  in  a 
table;  otherwise,  an  "N"  will  be  written. 

Three  macros  perform  the  initialization  of  the  process: 
[ConsArea],  [IM:  Init]  and  [AcousVec] . The  macro  [ConsArea] 
provides  the  coordinates  for  the  endpoints  in  the  fixed 
structure  (Section  2.2.3  and  Fig.  2.3),  determines  some  points 
for  the  sagittal  grid  (on  the  hard  palate)  and  estimates  the 
area  of  the  lower  pharynx  (up  to  the  point  H,  the  hyoid 
position) . The  macro  [IM:  Init]  implements  a whole 

articulatory  model  for  the  initial  values  of  the  parameters: 
reads  in  the  initial  articulatory  vector  x0  (file  arvec.in; 
Fig.  3.4,  switch  SI),  locates  all  the  articulators,  determines 
the  sagittal  distances  and  the  corresponding  cross-sectional 
areas,  estimates  the  first  four  formant  frequencies  (initial 
acoustical  vector  yD)  , draws  on  the  screen,  the  vocal  tract 
outline,  the  area  function  and  the  tables  with  the  values  of 
the  parameters  and  formants  (Fig. 3. 5).  Each  of  the  main 
regions  in  terms  of  sagittal  distance— to— area  conversion 
(pharynx,  soft  palate,  hard  palate,  region  between  alveolar 
ridge  and  incisors,  and  labial  region)  and  its  corresponding 
cross-sectional  areas  are  displayed  with  the  same  color.  The 
researcher  is  prompted  for  the  type  of  phoneme  (vowel  or 
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Initial  configuration  for  th.6  optimization  schems. 
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consonant)  and  for  the  position  of  the  velum  (the  default  for 
"nasal, " "nasalized"  and  "non-nasal"  or  the  coordinate  values 
in  the  input  file  arvec.in) . The  macro  [AcousVec]  reads  in 
the  target  acoustical  vector  y(ti)  (file  acvec.in;  Fig.  3.4, 
switch  S2)  determines  the  error  function  e (xt)  , writes  it  on 
the  screen  (Fig.  3.4)  and  sets  up  some  values  that  are  used  in 
the  algorithms  (the  initial  step  lengths,  increments,  etc.). 
The  JND's  for  the  target  vector  are  calculated.  The 
optimization  procedure  is  the  next  phase. 

Any  sequence  of  operations  can  be  chosen.  After  each 
iteration  the  current  vocal  tract  outline,  acoustic  vector, 
and  error  are  written  onto  the  screen,  using  a different  color 
for  each  iteration.  The  gradient  algorithm  is  implemented  by 
the  macros  [Gradie8],  or  by  the  three  macros  [Grad:LJT], 
[Grad: Tb]  and  [GradrHFi] . The  macro  [Gradie8]  implements 
automatically  the  gradient  algorithm  using  the  8-dimensional 
vector  "arv8, " normally  with  an  error  less  than  5%,  after  the 
tenth  iteration.  If  the  procedure  ends  without  the  occurrence 
of  the  "relative  error  condition,"  the  macro  can  be  applied 
again  to  obtain  smaller  errors  (the  last  value  of  the  step 
length  is  kept  for  the  first  iteration,  unless  the  macro 
[AcousVec]  is  used  again).  The  macros  [Grad:LJT],  [GradiTb], 
and  [GradrHFi]  use  the  6-dimensional  articulatory  vector 
"arv6,"  the  2-dimensional  articulatory  vector  "arv2"  and  the 
5-dimensional  articulatory  vector  "arv5, " respectively 
(described  in  Section  3.3.3)  . They  have  almost  the  same 
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structure  as  the  [Gradie8]  and  can  be  sequentially  applied,  in 
order  to  obtain,  sometimes  with  more  speed,  errors  less  than 
3%. 

The  dimensions  of  the  lower  pharynx  are  written  into  the  file 
"fixpar.dat. " 

Two  schemes  are  available  for  the  successive 
approximation  method.  The  first  scheme  uses  the  macros 
[Pert : JTb] , [PertrLTt],  and  [PertrHV]  for  the  Sets  1,  2,  or  3 
(presented  in  Section  3.3.3.),  respectively.  Again,  any 
sequence  can  be  used,  with  final  errors  about  the  same  as 
those  obtained  for  the  gradient  case.  The  macro  [PertrLTt] 
establishes  the  boundaries  for  the  tongue  tip,  distinguishing 
between  the  vowel  and  the  consonant  cases.  This  scheme  is 
included  in  the  Menu  1.  The  second  scheme  (Menu  2)  links  the 
sequence  "Set  1,  Set  2,  Set  3"  in  the  macro  [PertrVow], 
appropriate  to  deal  with  vowels,  and  the  sequence  "Set  2,  Set 
1,  Set  3"  in  the  macro  [PertrCon],  the  most  suitable  for 
consonants.  The  process  stops  after  ten  iterations  or  if  the 
error  function  for  each  of  the  three  sets  of  the  sequence  has 
not  decreased  twice. 

Part  2 includes  three  more  macros:  [DrawArVt], 
[WriteVec],  and  [Codebook].  The  macro  [DrawArVT]  erases  the 
successive  vocal  tract  outlines  drawn  during  the  optimization 
phase  and  draws  the  updated  area  function  and  the  vocal  tract 
outline . The  macro  [WriteVec]  writes  the  updated  articulatory 
vector  into  the  file  "arvec.in"  and  onto  the  screen,  and 
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writes  the  dimensions  of  the  pharynx  into  the  file 
"fixpar.dat,"  whenever  desired.  The  macro  [Codebook] 
constructs  the  codebook  (switch  S3  in  Fig.  3.5) . Two  files 
are  generated:  "artpar.dat,"  which  contains  the  optimized 
articulatory  vector,  the  dimensions  of  the  "fixed  structure," 
the  formant  targets  and  the  value  of  the  error,  and  the  file 
"areafun .mat , " with  the  equal-length-section  area  function  (60 
cross-sectional  areas)  and  the  length  of  the  section.  They 
are  renamed  for  each  mapped  phoneme.  To  retrieve  the  data 
from  the  codebook,  the  macro  [loadcode],  in  the  tree  of  the 
macro  [AS:  Util],  can  be  used.  The  file  "areafun .mat " is  fed 
to  the  subroutine  "TRACT"  in  the  articulatory  synthesizer. 
Figure  3 . 6 shows  the  matching  between  the  area  function  and 
its  equal-length-section  interpolated  version. 

— 2.*  Part  3 (Univariate  Optimization)  contains  a 
graphic  tool  to  display  the  behavior  of  the  error  function  due 
to  the  variation  of  only  one  articulatory  parameter  and  a 
number  of  macros  to  apply  the  gradient  algorithm  on  a one- 
parameter  basis.  This  part  is  not  normally  used  in  the 
inverse  mapping  process,  but  can  provide  further  improvement 
to  the  results  (errors  near  0%)  . The  macro  [PLOTINIT] 
initializes  the  graphic  tool.  The  macro  [PLOTSEN]  prompts  the 
researcher  for  the  name  of  the  articulatory  parameter  to  be 
incremented  and  for  the  value  of  its  increment.  Then,  the 
vocal  tract  configurations,  the  corresponding  formant  contour 
and  the  error  function  due  to  the  variation  of  the  parameter 
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Fig.  3.6  Area  function  and  its  equal-length-section  version. 
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are  plotted.  Figure  3.7  is  an  example  of  the  tongue-body 
variation  case.  Values  on  the  formant  contour  can  be  obtained 
using  the  cross-hair  cursor. 

For  each  parameter  and  dimension  of  the  vocal  tract, 
there  is  a routine  to  perform  the  unidimensional  optimization, 
using  a gradient.  They  are  also  concatenated  and  assigned  to 
the  macro  [Chain] . 

4.  Part  4 (Utilities)  contains  all  the  macros  used 

as  CAD  aids:  zoom,  erase,  block,  help,  draw,  colors,  etc. 

3.3.5  Results  and  Discussion 

This  approach  to  the  implementation  of  the  inverse 
mapping  proved  to  be  efficient  and  very  flexible  in  dealing 
with  problems  that  are  inherent  to  the  acoustic-to- 
articulatory  transformation. 

The  first  problem  is  related  to  the  chain  of  probable 
errors  in  the  system.  Mermelstein' s model  generally  provides 
a good  match  between  x-ray  tracing  and  vocal-tract  outline, 
but  there  is  not  enough  information  for  a robust 
representation  of  the  lower  part  of  the  pharynx,  for  the 
region  between  the  tongue  tip  and  jaw,  and  for  consonants,  in 
general.  Our  approach  optimizes  the  lower  part  of  the 
pharynx  and  uses  a different  representation  for  the  tongue 
tip-jsw  region  (Sections  3.3.3  and  2.2.3) . More  significant 
than  the  deviations  on  the  outline,  is  the  uncertainty  in  the 
sagittal  distance-to-cross-sectional  area  transformation. 
Different  empirical  conversion  formulas  have  been  used  by 
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researchers  (Mermelstein,  1971;  Lindblon  and  Sundberg,  1971; 
Flanagan  et  al . , 1980).  The  next  deviation  of  the  chain  is 
due  to  the  area  function-to-formant  conversion.  Finally,  the 
formant  contour  derived  from  the  formant  synthesizer  may  be 
another  source  of  error.  Mermelstein  (1971,  p.  1076)  obtained 
average  absolute  errors  of  10.3%,  4.9%  and  5.5%  (for  FI,  F2 
and  F3  respectively) , when  comparing  the  model-derived  and  the 
speech-derived  formant  frequencies.  The  vocal  tract  and 
articulator  dimensions  vary  across  different  people  (inter- 
speaker) and  for  the  same  person  in  different  situations 
(intra-speaker  variability) . Therefore,  the  final  error 
function,  generally  under  2%  (with  variations  in  the  lower 
pharynx  and  under  the  condition  of  proper  articulatory 
dynamics) , seems  to  be  adequate,  considering  the  possible 
deviations  throughout  the  system.  The  key  to  this 
accomplishment  is  due  to  the  appropriate  combination  of  the 
gradient  (unidimensional  and  multidimensional  optimization) 
and  successive  approximation  and  to  the  availability  of  a 
flexible  graphic  editor. 

The  second  and  more  severe  problem  arises  from  the 
difficulty  in  tracking  the  proper  articulatory  dynamics.  The 
acoustic-to-articulatory  mapping  is  not  unique  (Sondhi,  1979; 
Atal  et . al.,  1978),  when  the  dimension  of  the  articulatory 
domain  is  greater  than  that  of  the  acoustic  domain. 
Therefore,  physiologically  possible  but  dynamically  unnatural 
configurations  may  be  obtained,  depending  on  the  initial 
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configuration.  Moreover,  the  "ventriloquist  effect"  may  occur 
(Levinson  and  Schmidt  (1983,  p.  1153)  : during  the  minimization 
process,  one  or  more  parameters  may  stay  fixed  on  the 
boundaries  of  a feasible  region,  while  the  others  change  to 
compensate  for  the  error.  This  is  not  unexpected,  since  human 
beings  are  endowed  with  the  power  of  compensatory 
articulation . 

The  following  measures  were  adopted  to  attack  the 
articulatory  dynamics  problem: 

(1)  To  consider  the  vocal  tract  losses  in  the  area-to- 
formant  conversion  (Atal  et  al . , 1978;  Charpentier,  1984; 
Wright  and  Elliot,  1990)  . 

(2)  To  begin  with  the  iterative  articulatory  model 
described  in  Part  1,  Section  3.3.4,  and  in  Chap.  2,  in  order 
to  set  up  an  initial  configuration  as  close  as  possible  to  the 
expected  optimal  configuration.  This  is  particularly 
important  for  the  tongue  body  and  tongue  tip,  which  determine 
the  location  of  the  constriction  for  vowels  and  consonants, 
respectively . 

(3)  To  delimit  the  range  of  the  parameter  values. 

(4)  To  use  an  appropriate  sequence  for  the  parameter 
variation  (in  the  successive  approximation  technique)  and  to 
impose  different  rates  of  variation  on  some  articulators.  For 
example,  the  tongue  body,  in  the  case  of  vowels,  is  the  first 
to  reach  its  final  desired  position. 


188 


(5)  To  allow  an  increase  in  the  error  if  an  improvement 
in  the  articulatory  dynamics  can  be  obtained. 

(6)  To  assist  the  process,  checking  on  the 
configurations  and  the  available  data  from  researchers. 
Information  about  the  fourth  formant,  the  bandwidths,  the 
constriction,  vocal  tract  length,  mouth  opening,  etc  (Atal  et 
al.,  1978)  are  sometimes  helpful  but  can  not  be  used  in  a 
systematic  way. 

(7)  To  maintain  the  consistency  during  the  process  of 
optimization.  Compensatory  articulation  does  exist,  that  is, 
"different  people  can  produce  the  same  sound  with  different 
vocal  tract  shapes"  (Atal  et  al.,  1978,  p.  1555).  By  keeping 
the  consistency  of  the  procedure,  we  are  trying  to  select  a 
vocal  tract  dynamics  associated  with  one  hypothetical  person 
who  could  have  emitted  the  utterance. 

Our  implementation  does  provide  handy  and  flexible  tools 
for  solving  these  problems,  but  can  not  be  considered 
completely  independent  of  the  researcher's  assistance.  The 
results  are  good  in  that  the  gestures  we  obtained  nearly 
matched  those  of  the  subject  who  emitted  the  utterance.  The 
final  and  important  point  to  be  considered  is  the  validation 
of  the  results  through  a listening  assessment. 

Figures  3.8  to  3.12  show  the  vocal  tract  area  function, 
the  outline  of  midsagittal  plane,  the  articulatory  and 
acoustic  vectors  for  selected  phonemes. 
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Figures  3.13  to  3.18  show  the  vocal  tract  area  function, 
the  outline  of  midsagittal  plane,  the  articulatory  and 
acoustic  vectors  for  targets  on  the  formant  contour  of  the 
sentence  "We  were  away  a year  ago." 

Figures  3.19  to  3.23  show  the  synthetic  speech  for 
selected  phonemes,  using  our  database  of  vocal  tract  areas  and 
the  volume-velocity  excitations  obtained  using  speech  inverse 
filtering . 

Figure  3.24  shows  the  preliminary  results  of  a text-to- 
speech  synthesis  for  the  sentences  "We  were  away  a year  ago" 
(in  American-English)  and  "Eu  dei  a banana"  (in  Portuguese) , 
using  empirical  phonological  rules. 
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and  outlines  for  the  phonemes  EY  and  IH. 
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igure  3.13 


196 


TX 

4.34388 

TY 

14.71522 

TOX 

13.10487 

toy 

14.70908 

HL 

1-0.08477 

poi_ 

0.30000 

HY 

1-0.31465 

V* 

12.06540 

YI_ 

15.10188 

FQRMANTS 

JND 

ri  I 340.9 

Y 

(■  2 J /24  2 

Y 

TTl  2185.1 

V 

ERROR  10.80835 

V. 

CM 


/W5526/ 


TET  jl-0,31181 

TX 

4 44477 

TY 

4.65777 

TOX 

3.16456 

TOY 

4.72439 

HL 

-0.07654 

PPL 

0.30000 

HY 

-0.45276 

V! — 

2.15360 

vr 

■5,21100 

FORMANTS 

jndI 

FI 

440.7 

Y 

r? 

lt)97.-? 

Y 1 

n 

15490 

Y 

ERROR  lo.OOOOO  X] 

/V5968/ 


8 

4 


CM 


2 4 6 9 


10  12  14  16  IS 


TETJ 

-0,30765 

TX 

4.40821 

TY 

469282 

TOX 

3.14613 

TOY 

469969 

FDRMANTSijNDl 

HL 

-0.08866 

FI  1 462.4  - Y 1 

PP'_ 

0.30000 

HY 

TT]  1027.3  1 * 

Tin  1726.3  Yl 

2.13635 

ERROR  (o.OOOCO 

na_i 

Figure  3.14 


Articulatory  vectors,  vocal  tract  area  functions 
and  outlines  for  the  targets  W4355  to  W6756. 
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Articulatory  vectors,  vocal  tract  area  functions 
and  outlines  for  the  targets  W7201  to  W9488. 
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and  outlines  for  the  targets  W9851  to  W11584. 
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Figure  3.17  Articulatory  vectors,  vocal  tract  area  functions 
and  outlines  for  the  targets  W11827  to  W13239. 
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Figure  3.18  Articulatory  vectors,  vocal  tract  area  functions 
and  outlines  for  the  targets  W14114  to  W15385. 
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Figure  3.19 


Synthetic  speech  and  volume-velocity  excitation 
for  the  phonemes:  a)  AX  b)  ae. 
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a) 


b) 


Figure  3.20  Synthetic  speech  and  volume-velocity  excitation 
for  the  phonemes:  a)  EH  b)  ER. 
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Figure  3.21  Synthetic  speech  and  volume-velocity  excitation 
for  the  phonemes:  a)  EY  b)  IH. 
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Figure  3.22 


Synthetic  speech  and  volume-velocity  excitation 
for  the  phonemes:  a)  IY  b)  OW. 
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Figure  3.23 


Synthetic  speech  and  volume-velocity  excitation 
for  the  phoneme  UW. 
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CHAPTER  4 

VALIDATION  AND  EXPERIMENTS 
4 . 1 Introduction 

This  chapter  is  related  to  the  validation  of  the 
articulatory  synthesizer  through  experiments  with  modeling 
voice  disorders  and  female  voice  modeling. 

The  field  of  voice  disorders  is  science,  embracing  the 
clinical  systems  for  "recognition,  diagnosis,  and  treatment  of 
patients  with  voice  impairment  and  laryngeal  pathology" 
(National  Institute  on  Deafness  and  Other  Communication 
Disorders,  1989.  Speech  and  voice  systems,  however,  are 
strongly  interrelated.  Voice  disorders  are  caused  mainly  by 
impairments  in  the  vocal  fold  vibration  (spasmodic  dysphonia, 
vocal  tremor,  and  laryngospasm) , by  improper  kinematic 
articulation  (mandibular  restriction,  inappropriate  velar 
posturing,  tongue  malpositioning,  etc.)  and  by  abnormalities 
in  the  tracts  (cleft  palate,  blockages,  etc)  . Some  of  the 
major  investigation  topics  in  the  National  Strategic  Research 
Plan  (1990)  that  are  concerned  with  speech  systems  are: 

(1)  Regulation  of  pitch,  loudness,  register,  and  vocal 
quality  (effects  of  changes  in  configuration  and  dimensions  of 
the  vocal  folds) . 
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(2)  Acoustical  (and  biomechanical)  source-tract 
interaction . 

(3)  Study  of  voice  production  using  acoustic  aids  (and 
others)  and  methods  for  voice  measurement  with  acoustic 
measures  (and  others) . 

(4)  Characterization  of  the  specific  disorders. 

Under  these  basic  guidelines,  Mind-Machine  Interaction 

Research  Center  (MMIRC)  has  been  investigating  the  statistical 
correlation  between  glottal  factors  and  various  vocal 
characteristics  (modal,  creaky,  breathy,  rough,  hoarse)  with 
significant  results  (Childers,  1987;  Childers  et  al . , 1987; 
Childers,  1988;  Childers  et  al . , 1988;  Lee  and  Childers,  1989; 
Eskenazi  et  al . , 1990);  Lalwani  and  Childers,  1991).  Several 
experiments  have  been  conducted  to  evaluate  the  effect  of  the 
glottal  factors  on  the  simulation  of  vocal  disorders.  The 
final  objective  is  to  quantify  and  classify  the  vocal 
disorders  caused  by  problems  in  the  glottis  and  in  the  vocal 
and  nasal  tracts  (Childers,  1988).  Figure  4.1  summarizes  the 
perceptual  correlates  of  abnormal  voices,  showing  the  extreme 
opposite  conditions  with  respect  to  a normal  voice.  Table  4.1 
and  Table  4.2,  extracted  from  MMIRC  reports,  summarize  the 
definitions  of  the  various  acoustic  measures  and  the  features 
related  to  the  source  for  various  voice  types. 

Male-to-female  voice  conversion  has  also  been  addressed 
by  MMIRC,  aimed  at  finding  the  acoustic  correlates  of  gender 
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TABLE  4.1  DEFINITIONS  OF  VARIOUS  ACOUSTIC  MEASURES 


ACOUSTIC 

MEASURES 

DEFINITIONS 

APQ 

Amplitude  Perturbation  Quotient:  measures  pitch 
perturbations  from  a smoothed  trend  line. 

DPF 

Directional  Perturbation  Factor:  percentage  of 
the  total  number  of  differences  between 
adjacent  cycle  durations  for  which  there  is  a 
a change  in  algebraic  sign. 

EX 

Coefficient  of  Excess:  correlated  to  the  signal 
to-noise  ratio  of  the  residue  signal. 

HNR 

Harmonics-to-Noise  Ratio:  Ratio  of  acoustic 
energy  of  the  stable  harmonic  to  that  of  the 
noise . 

Mean 

Jitter 

Average  of  cycle  to  cycle  pitch  perturbations. 

PA 

Pitch  Amplitude:  amplitude  of  main  peak  of 
autocorrelation  of  residue  signal  (measures 
degree  of  voicing. 

% JIT 

Percent  Jitter:  ratio  of  mean  jitter  in  ms  by 
mean  period  in  ms,  multiplied  by  100. 

PPF 

Percentage  of  differences  between  the  duration 
of  adjacent  cycles  that  exceed  0.5  ms. 

PPQ 

Pitch  Perturbation  Quotient:  measures  pitch 
perturbations  from  a smoothed  trend  line. 

SFF 

Spectral  Flatness  of  Inverse  Filter:  measures 
the  masking  of  formant  frequency  amplitudes  and 
bandwidths  by  noise. 

SFR 

Spectral  Flatness  of  Residue  Signal:  measures 
the  masking  of  fundamental  frequency  harmonic 
amplitudes  by  noise. 

SHIMMER 

Average  of  cycle  to  cycle  amplitude 
perturbations . 

Source 

Lee  and  Childers,  (1989)  . 
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TABLE  4.2  CHARACTERISTICS  OF  GLOTTAL  FACTORS 


Quality 

Modal 

Creaky 

Breathy 

Rough 

Hoarse 

Parameter 

Fundamental 

Frequency 

M 

L 

M 

M 

M 

Pulse 

Width 

M 

Short 

Long 

M 

M 

Pulse 

Skewness 

M 

H 

L 

M 

M 

Abruptness 
of  Closure 

M 

F 

S 

M 

M 

Aspiration 

Noise 

M 

L 

H 

M 

H 

Jitter 

L 

L 

L 

H 

H 

Shimmer 

L 

L 

L 

H 

H 

Harmonic 

Richness 

Factor 

M 

H 

L 

- 

- 

Harmonic 
to  Noise 
Ratio 

H 

H 

M 

- 

- 

Spectral 

Tilt 

M 

L 

H 

- 

- 

Spectral 

-12 

-6 

-12 

Slope 

~ 

~ 

— 

_ 

(dB/oct) 

-18 

-12 

-18 

Source : Lee  and  Childers,  (1989)  . 


H:  high  L:  low  M:  medium  S:  slow 


F:  fast 
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Figure  4.1  Perceptual  Correlates  of  Abnormal  Voices. 
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(Childers  et  al . , 1985;  Childers  and  Wu,  1990;  Childers  et 
al . , 1988)  . 

Our  goal  is  to  make  it  evident  that  the  articulatory 
synthesizer  can  provide  some  features  that  are  not  available 
in  the  formant  and  LPC  synthesizers. 

Only  the  articulatory  vocal-tract  model  can  simulate  the 
disorders  associated  with  improper  articulation,  tumors,  cleft 
palate.  Changes  in  the  length  and  in  cross-sectional  area  for 
simulating  female  and  child  voices  are  easily  made.  For  the 
vocal  fold  movement,  our  glottal  excitation  model  can  provide, 
in  addition  to  the  parameters  available  in  the  formant  and  LPC 
synthesizers,  the  dimensions  and  weights  of  the  vocal  folds. 
Figure  4.2  illustrates  the  scheme  for  the  conversion  of 
voices:  the  effect  of  scaling  the  male-voice  values  and 
parameters  is  assessed  on  a one— by— one  basis  (one  switch 
contact  is  closed  each  time)  and  via  a combination  of  changes 
(subset  of  closed  switch  contacts)  . The  block  "Tests"  refer 
to  simple  experiments  on  rough  and  breathy  voices  and  on  vocal 
tract  abnormalities. 

4 . 2 Female  Voice  Simulation 

The  relevant  data  for  female  voice  simulation  are 
summarized  in  Table  4.3.  Two  options  are  available:  male-to- 
female  conversion  or  direct  synthesis  of  female  voices. 
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For  the  male-to-female  conversion,  glottal  parameters  and 
tract  dimensions  for  a modal  male  voice  are  scaled,  using 
suitable  factors,  shown  in  Table  4.3. 

Fundamental  frequency.  The  second  formant  frequency  and 
bandwidth,  and  the  fundamental  frequency  are  considered  the 
primary  factors  responsible  for  the  identification  of  gender 
(Childers  et . al . , 1991)  . The  mean  male  fundamental  frequency 
values  spans  120-130  Hz,  while  the  mean  female  values  ranges 
from  190  to  220,  according  to  Nittrouer  et  al.,  1990. 
However,  researchers  have  found  different  ranges  for  different 
data.  The  pitch  depends  on  the  size  and  mass  of  the  vocal 
folds.  Vocal  fold  length  ranges  from  17  to  24  mm  for  men  and 
from  13  to  17  mm  for  women  (Borden  and  Harris,  1984)  . 

Vocal  tract.  The  male  tract  is  typically  1.15  times 
longer  than  the  female  tract  (Fant,  1973)  . The  difference 
between  male  and  female  vocal  tracts  is  more  pronounced  in  the 
pharynx  cavity.  The  ratio  of  pharyngeal  length  to  mouth 
cavity  length  is  greater  for  males  than  for  females. 
Therefore,  the  first  attempt  would  be  to  use  different  scale 
factors  (kp,  ko,  kn,  kr  in  Fig.  4.2)  for  the  pharynx  and  oral 
tract  width  and  length.  However,  this  procedure  does  not 

ensure  good  results  for  the  first  formant  (Traunmuller,  1984)  . 
A better  alternative  is  to  derive  the  vocal  tract 
configurations  directly  from  the  female  voice  formants 
(Chap. 3) . 
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Figure  4.2 


Scheme  for  conversion  of  voices. 
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TABLE  4.3  COMPARISON  BETWEEN  MALE  AND  FEMALE  VOICES 


— 

MALE 

FEMALE 

1 . Fundamental  Frequency 

Range:  112  to  146  Hz 
FOra 

Range:  170  to  275  Hz 
FOf  = FOm  Kf 
Kf : 1.45  to  1.8 

2.  Spectrum 

For  loud  female  voices  the  spectral  slope  is  steeper 

3.  Formant  Frequencies 

FmN 

N:  formant  # (1  to  4) 

FfN  = FmN  KNa  > FmN 
N: 1,2, 3, 4 mean  KNa:  1.2 
KNa  > 1 vowel-dependent 

4 . Glottal  Parameters 

Asymmetrical  waveform 
Speed  quotient  Qm_,  > 0.5 
Hump  in  the  opening  phase 
Open  quotient  Qore:0.6-0.8 

" More  symmetrical" 
Qmf : 0.5 
no  "hump" 

Qof  > Qom 

5.  Breathiness 

Rest  area  Agrninm:  near  0 
Harmonic/noise  HNR,,, 

Agminf  > 0 

HNRf  > HNR,,,  (noise) 

6.  Vocal  Tract  Dimensions 

Total  length  lvB  : 
16.5  to  17.5  cm 
Pharynx  length  lpra 

Oral  length  lora 

Pharynx  area  Apm 
Oral  area  Aom 

lvf  = lvB  kv  < lvB 
kv  typical  0.87 
!Pf  = lp„  kp 
kp  typical:  0.8 
lof  = loB  ko 
ko  typical:  0.95 
kp  < ko  < 1 
Apf  = Apm  kn 
kn  typical:  0.82 
Aof  = A°m  kr 
kr  typical:  1.0 
kn  < kr  < 1 

7.  Vocal 

Fold  Dimensions 

Thickness  d,,, 
Length  lgm 

df  = d,,,  kd  kd  < 1 

1(3f  = lgm  kg  kg  ~=  0.8 
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Breathiness . A greater  amount  of  breathiness  is 
generally  found  in  the  female  voice  than  in  the  male  voice 
(Koike  and  Hirano,  1973)  . That  is  a cue  for  decreasing  the 
closed  phase,  increasing  the  rest  area  and  also  for  inserting 
turbulent  noise  at  the  glottis  during  the  male/female 
conversion  (Klatt,  1987)  . 

Three  measures  have  been  used  to  evaluate  noise  in  the 
speech  signal:  the  noise-to-harmonic  ratio  NHR  (Yumoto  et  al . , 
1982/  Lee  and  Childers,  1989),  the  normalized  noise  energy  NNE 
(Hasuya  et  al.,  1986),  and  the  breathiness  index  Br  (Fukazawa, 
1988)  . The  NHR  is  defined  as  the  ratio  of  the  acoustic  energy 
of  the  stable  harmonic  to  that  of  the  noise.  Lee  and  Childers 
(1989)  used  only  the  ratio  of  the  noise  and  harmonic 
components  above  2 kHz  to  obtain  a better  predictor  of 
breathiness,  called  NHRh . The  NNE  is  defined  as  the  ratio  of 
the  energy  of  the  noise  in  the  speech  signal  to  the  total 
energy  of  the  speech  signal  measured  from  its  spectrum. 
Hasuya  et  al.  (1986)  reported  that  the  NNE  is  more  robust  than 
the  NHR,  and  that  the  NNE  alone  can  well  detect  vocal  fold 
nodules  and  polyps,  recurrent  nerve  paralysis  and  advanced 
glottic  cancer.  The  Br,  defined  as  the  ratio  of  the  energy  of 
the  second  derivative  of  the  preemphasized  voice  signal  to  the 
energy  of  the  preemphasized  voice  signal  times  100  (Fukazawa, 
1988),  proved  to  be  the  most  simple  and  effective  to  correlate 
breathiness.  Frequency  and  amplitude  perturbations  (jitter 
and  shimmer) , caused  by  abnormal  vocal  fold  vibrations, 
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superimpose  noise  on  the  harmonic  component  of  the  speech 
signal,  impairing  the  use  of  the  NHR  and  NNE.  In  this  case 
only  the  Br  can  be  used.  However,  the  Br  can  not  be  used  with 
voices  of  rapidly  varying  intensity,  since  must  be  measured  in 
vowels  with  constant  intensity  (Corazzini,  1991)  . Corazzini 
(1991)  found  NNE  ranging  from  -6.25  to  -23.54  and  Br  from  21.7 
to  120.2,  for  normal  voices.  The  model  and  data  for 
simulating  breathiness  are  provided  by  Lee  and  Childers  (1989) 
and  Lalwani  and  Childers  (1991),  who  verified  that  a 
amplitude-modulated  noise  source  can  enhance  the  perception  of 
naturalness.  Stevens  (1971)  reported  that  the  best  placement 
of  the  noise  source,  within  a pitch  period,  is  near  the  peak 
volume  velocity. 

Glottal waveform.  The  harmonic-amplitude  difference 

(amplitude  of  the  first  harmonic  relative  to  that  of  the 
second)  is  greater  in  female  voices,  that  is,  the  spectral 
slope  is  steeper  for  female  voices  (Klatt  and  Klatt,  1990)  . 
Thus,  female  voice  glottal  excitations  have  longer  open 
quotients  and  are  "more"  symmetrical  than  those  of  the  male, 
that  is  the  opening  and  closing  parts  are  approximately  equal 
(Kitzing  and  Sonerson,  1974;  Cheng  and  Guerin,  1987;  Childers 
et  al . , 1987;  Nittrouer  et . al . , 1990) . Male  volume-velocity 
waveforms  frequently  show  a hump  in  the  opening  phase,  while 

female  waveforms  seldom  display  it  (Ishizaka  and  Flanagan, 
1972) . 
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In  the  direct  synthesis  of  female  voice,  the  vocal  tract 
dimensions  are  derived  from  the  inverse  mapping  scheme  using 
target  formants  for  female  voice.  A "more  symmetrical" 
glottal  excitation  (speed  quotient  near  0.5),  with  a rest  area 
slightly  greater  than  zero,  and  a pitch  contour  ranging 
between  170  and  275  are  used. 

4 . 3 Creaky  Voice  Simulation 

Creaky  voice  (or  vocal  fry)  is  included  in  Fig.  4.1  in 
the  class  of  harsh  voices.  Lee  and  Childers  (1989)  conducted 
comparative  measurements  of  glottal  excitation  features,  for 
modal,  creaky,  falsetto,  and  breathy  voices.  The  features 
were  "the  instant  of  maximum  closing  slope  of  the  glottal 
pulse,  the  glottal  pulse  width,  and  the  glottal  pulse  skewness 
(ratio  of  duration  of  glottal  opening  phase  to  duration  of 
closing  phase) ."  They  also  measured  the  "harmonic  richness 
factor,  " HRF,  (ratio  of  the  sum  of  the  intensities  of  the 
harmonics  to  the  intensity  of  the  fundamental  frequency)  and 
the  "waveform  peak  factor, " WPF  (the  decay  characteristics  of 
the  waveform  during  a single  pitch  period) . Results  and 
distinctive  characteristics  between  modal  and  creaky 
registers  are  listed  in  Table  4.4.  These  acoustic  correlates 
provide  the  cues  to  introduce  the  perception  of  vocal  fry  into 
a modal  speech:  a very  low  fundamental  frequency  (24  to  52  for 
male  voice) , a volume-velocity  waveform  with  reduced  glottal 
pulse  width  (about  25%  of  the  pitch  period) , small  open 
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TABLE  4.4  CHARACTERISTICS  OF  CREAKY  VOICE 

1.  PERCEPTION 

Rapid  series  of  taps,  the  sputter  of  a low-powered  outboard 
motor;  low  pitch  ; harsh. 

2.  PHYSIOLOGICAL  CORRELATES 

a.  VOCAL  FOLD  VIBRATION:  only  at  the  anterior  portion. 

b.  LAG;  large  vertical  phase  difference  between 
the  lower  and  upper  edges  of  the  folds. 

c.  ADDUCTION : folds  tightly  pressed  together  during 
collision . 

d.  FREQUENCY  CONTROL : length  and  thickness  of  the  vocal 
folds  seem  not  to  vary  as  the  pitch  changes  (frequency 
control  done  mainly  by  the  subglottal  pressure) . 

3.  ACOUSTIC  CORRELATES 

a.  FUNDAMENTAL  FREQUENCY:  very  low  (24  to  52  Hz). 

b.  SPEECH  WAVEFORM:  high  decay  between  excitation  pulses 
(about  43  dB)  ; waveform  peak  factor:  WPF  fry  > WPFmodal. 

c.  SPECTRUM:  more  visible  formant  structure  of  spectral 
lines  and  higher  level  in  the  upper  range  of  frequencies 
(compared  with  modal  voice  spectrum) . 

d.  EXCITATION  WAVEFORM:  small  open  quotient  (about 
0.15)  and  large  speed  quotient  (steep  falling  slope); 
harmonic  richness  factor  HRFfry  > HRFmodal  (about  10  dB) ; 
glottal  area:  pitch  period  with  1,  2,  or  3 peaks. 
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quotient  and  large  speed  quotient.  Dampening  alternate 
glottal  pulses  (Colman,  1963),  using  a glottal  area  that 
displays  2 or  3 peaks,  and  reducing  the  intensity  contour 
(about  20  dB)  can  enhance  the  perception  of  vocal  fry 
(Childers,  1988).  Fig  4.3  shows  some  waveforms  that  can  be 
derived  from  our  source  model  in  order  to  simulate  creaky 
voices . 


4 . 4 Results  and  Discussion 

Figures  4.4  to  4.12  illustrate  the  results  of  a few 
experiments  on  male-to-female  voice  conversion  and  Fig.  4.13 
shows  the  result  of  a creaky  voice  simulation. 

Figures  4.4  and  4.5  show  high-quality  synthetic  speech 
sentences  and  their  corresponding  spectrogram  for  a male  and 
female  utterance,  respectively. 

Figures  4.6  to  4.8  show  synthetic  speech  sentences  with 
the  same  values  for  their  parameters:  open  quotient  Qo  = 0.96, 
speed  quotient  Qm  = 0.5,  rest  area  Agmin  = 0.025  cm2,  and 
phase  shift  0 = 55.  However,  in  the  experiment  shown  in  Fig. 
4.8,  insertion  of  turbulent  noise  at  the  glottis  and  scaling 
in  the  vocal  tract  and  in  the  vocal  folds  were  used;  in  the 
experiment  shown  in  Fig.  4.7  only  turbulent  noise  was  used; 
and  in  the  experiment  shown  in  Fig.  4.6  neither  scaling  nor 
turbulent  noise  were  used.  The  experiment  shown  in  Fig.  4.9 
is  the  same  of  that  shown  in  Fig.  4.8,  except  for  the 
fundamental  frequency  whose  average  is  now  about  250  Hz  (FOav)  . 
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GLOTTAL  AREA  FOR  VOCAL  FRY 


GLOTTAL  AREA  FOR  VOCAL  FRY 


Figure  4.3  Glottal  area  for  vocal  fry  simulation. 
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Figures  4.10  and  4.11  show  the  results  of  experiments 
that  use  Qo  = 0.96,  Qm  = 0.5,  Agmin  = 0.025  cm2,  0 = 0.0  and 
F0av  = 250  Hz.  In  the  experiment  of  Fig.  4.11,  scaling  and 
turbulent  noise  were  used,  whereas  in  the  experiment  of  Fig. 
4.10,  neither  scaling  nor  turbulent  noise  were  used. 

Figure  4.12  shows  a synthetic  speech  sentence  obtained 
with  Qo  = Qm  = 0.6,  FOav  = 185  Hz,  © = 0.0,  Agmin  = 0.0,  with 
scaling  and  insertion  of  turbulent  noise. 

Finally,  Fig.  4.13  shows  the  simulation  of  a creaky 
voice,  using  a average  fundamental  frequency  of  30  Hz,  a open 
quotient  of  0.15,  and  a speed  quotient  of  0.8. 

Without  a comprehensive  and  exhaustive  experimentation, 
it  is  not  advisable  to  infer  definitive  conclusions. 
Nevertheless,  some  global  directions  can  be  derived  from  our 
experiments : 

(1)  The  fundamental  frequency  and  the  excitation 
waveform  seem  to  be  the  key  factors  that  affect  the  process. 

(2)  Insertion  of  turbulent  noise  increases  the 
perception  of  breathiness,  but  its  effect  seems  to  be  small  in 
terms  of  discrimination  between  female  and  male  voices. 

(3)  The  appropriate  scaling  of  the  glottal  fold  and 
vocal  tract  dimensions  needs  to  be  investigated  further.  The 
results  of  scaling  seem  to  take  the  original  voice  towards  the 
pattern  of  child  voices.  A suitable  alternative  seems  to  be 
the  derivation  of  the  vocal  tract  areas  from  the  female  voice 
formant  contour. 
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(4)  Vocal  fry  can  be  well  simulated  without  dampening 
alternate  glottal  pulses  and  also  without  using  glottal  areas 
of  the  type  sketched  in  Fig.  4.3. 

As  commented  earlier,  our  goal  is  to  provide  a flexible 
tool  that  can  improve  on  the  experiments  in  voice  conversion 
and  vocal  disorders,  by  offering  additional  parameters  and  an 
efficient  realization. 
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SYNTHETIC  SPEECH  WWW1.SD 


SAMPLE  (Ts=  0.1  ms) 


(Hz. ) 


Figure  4.4  Synthetic  speech  sentence  "We  were  away  a year 
ago,"  and  its  wideband  spectrogram,  for  a male 
speaker . 


225 


xlO4  SYNTHETIC  SPEECH  WWWFF.SD 


SAMPLE  (Ts=  0.05  ms)  xlO4 


(Hz. ) 


5000 

T| 

m 

n 


2000- 


0 

0.03 
Time  ( 


Figure  4.5  Synthetic  speech  sentence  "We  were  away  a year 
ago, " and  its  wideband  spectrogram,  for  a 
female  speaker. 
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SYNTHETIC  SPEECH  W255F.SD 
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SAMPLES  Ts=0.1  ms 


Figure  4.6  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0 5, 

Agmin  = 0.025  cm2,  © = 55,  F0av  = 185  Hz,  using 
neither  insertion  of  turbulent  noise  at  the 
glottis  nor  scaling  in  both  vocal  tract  and 
vocal  fold  dimensions. 
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Figure  4.7  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0.5, 

Agmin  = 0.025  cm2,  © = 55,  F0av  = 185  Hz,  using 
insertion  of  turbulent  noise  at  the  glottis  and 
no  scaling  in  both  vocal  tract  and  vocal  fold 
dimensions . 
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Figure  4.8  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0.5, 

Agmin  = 0.025  cm2,  0 = 55,  F0av  = 185  Hz,  using 
insertion  of  turbulent  noise  at  the  glottis  and 
scaling  in  both  vocal  tract  and  vocal  fold 
dimensions . 
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SYNTHETIC  SPEECH  W255F250.SD 
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Figure  4.9  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0.5, 

Agmin  = 0.025  cm2,  0 = 55,  using  insertion  of 
turbulent  noise  at  the  glottis  and  scaling  in 
both  vocal  tract  and  vocal  fold  dimensions,  and 
F0av  = 250  Hz 
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Figure  4.10  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0.5, 

Agmin  = 0.025  cm2,  0 = 0.0,  F0av  = 250,  using 
neither  insertion  of  turbulent  noise  at  the 
glottis  nor  scaling  in  both  vocal  tract  and 
vocal  fold  dimensions. 
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Figure  4.11  Synthetic  speech  with  Qo  = 0.96,  Qm  = 0.5, 

Agmin  = 0.025  cm2,  0 = 0.0,  F0av  = 250,  using 
insertion  of  turbulent  noise  at  the  glottis  and 
scaling  in  both  vocal  tract  and  vocal  fold 
dimensions . 
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Figure  4.12  Synthetic  speech  with  Qo  = 0.6,  Qm  = 0.6, 

Agmin  = 0.0  cm2,  © = 0.0,  F0av  = 185  Hz,  using 
insertion  of  turbulent  noise  at  the  glottis  and 
scaling  in  both  vocal  tract  and  vocal  fold 
dimensions . 
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Figure  4.13  Simulation  of  creaky  voice  with  Qo  = 0.15, 

Qm  = 0.8,  Agmin  = 0.0  cm2,  0 = 0.0,  F0av  = 30  Hz, 
and  without  dampening  alternate  glottal  pulses. 


CHAPTER  5 

CONCLUSIONS  AND  SUGGESTIONS 
5 . 1 Introduction 

The  purpose  of  this  research  effort  was  to  achieve  an 
articulatory  synthesizer  which,  besides  being  a flexible  and 
robust  tool  for  perception  studies,  could  provide  the  highest 
quality  possible  for  a given  reasonable  computational 
efficiency.  The  following  goals  were  established:  1)  the 

implementation  of  the  "Articulatory  Model"  as  an  interactive 
graphic  editor,  2)  the  development  of  the  "Acoustic  Model,"  3) 
a solution  to  the  "Inverse  Mapping"  problem,  and  4)  a 
validation  of  the  synthesizer  through  simple  experiments. 

5 . 2 Summary  and  Discussion 

5.2.1  Articulatory  Model  Realization 

The  articulatory  model  has  been  implemented  as  a graphic 
editor,  for  interactive  and  noninteractive  tasks.  It  generates 
the  area  functions  for  the  articulatory  synthesizer  and  can 
also  provide  the  formants  for  the  formant  synthesizer.  It  has 
been  based  on  Mermelstein' s model,  with  a few  modifications. 

The  following  features  are  available: 

Implementation  for  work-stations  and  personal  computers. 
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The  vocal  tract  outline  is  drawn  articulator-by- 
articulator, using  "button"  macro  commands,  cross-hair 
coordinates  and  keyboard  entries.  The  model  can  deal  with 
both  vowels  and  consonants.  The  system  is  menu-driven  and 
provides  on-line  help.  The  researcher  can  easily  change  the 
position  of  any  articulator.  Default  positions  are  provided 
for  the  velum,  hyoid  and  tongue  tip  and  previous  values  can  be 
selected  for  all  the  parameters.  The  fixed  structure  of  the 
sagittal  profile  can  be  easily  modified  in  order  to  simulate 
both  male  and  female  vocal  tracts. 

The  sagittal  grid  is  oriented  according  to  the  position 
of  the  articulators,  enabling  better  measurement  of  the 
sagittal  distances. 

The  system  provides  different  drawing  and  editing 
tools,  e.g.,  windowing,  zooming,  scaling,  layering,  rotating, 
coloring,  measuring  of  distances  and  angles,  etc. 

The  realization  displays  a tridimensional  representation 
of  the  area  function  waveform,  the  correspondent  contour  of 
formants  (first  to  fourth) , the  position  of  the  constriction 
and  its  cross-sectional  area.  In  the  graphics  and  tables, 
each  phoneme  and  corresponding  area  function  and  formant 
points  use  the  same  specific  color. 

These  features  have  made  the  articulatory  model  efficient 
and  flexible.  It  can  interface  with  the  inverse  mapping 
realization,  exporting  proper  initial  configurations  for  the 
optimization  procedure  or  importing  codebooks. 
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5.2.2  Development  of  the  Acoustic  Model 

A flexible  and  robust  target-based  acoustical  model, 
composed  of  the  glottal  source,  vocal  and  nasal  tract,  noise 
source  and  radiation,  has  been  realized.  A time-domain 
implementation  has  been  chosen  not  only  to  achieve  high- 
quality  and  natural-sounding  synthesized  speech  but  also  to 
improve  on  the  experiments  in  voice  conversion  and  vocal 
disorders . 

For  the  glottal  source,  we  proposed  two  models  based  on 
the  two-mass  model  that  harmonize  good  synthetic  speech 
quality  and  low  computational  burden.  The  first  is  the 
"parametric  2-mass  model, " which  simplifies  the  generation  of 
the  two  glottal  areas.  The  second  is  the  "equivalent  glottal 
area  model."  The  models  are  convenient  for  simulating 
abnormal  vocal  fold  vibration,  since  they  can  provide  more 
controls  than  other  models,  such  as  the  LPC  and  formant 
synthesizers.  The  "equivalent  glottal  area,"  is 
mathematically  processed  as  a one-mass  model,  but  includes  the 
relevant  parameters  of  the  two-mass  model.  We  established  the 
relationship  between  the  acoustic  parameters  and  the  control 
parameters  of  the  model.  For  the  derivation  of  the  pitch 
contour  we  have  developed  an  algorithm  based  on  peak  picking, 
zero-crossing  and  parabolic  interpolation,  capable  of  work  in 
the  entire  range  of  speech. 

A time-domain  approach  inspired  by  Maeda's  work  (1982) 
has  been  chosen  for  the  vocal  and  nasal  tracts.  The 
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trapezoidal  algorithm  has  been  adopted  in  order  to  solve  the 
stiff  differential  equations,  assuring  numerical  stability, 
regardless  of  the  step  size.  Source-tract  interaction  is 
provided.  The  radiation  model  is  that  of  Flanagan  (1972a) . 
Noise  sources  are  used  for  simulating  fricatives,  stops  and 
aspirated  sounds.  The  maxillary  sinuses  can  be  coupled  to  the 
nasal  tract.  The  glottal  length  and  thickness,  velopharyngeal 
port  area,  sinuses,  critical  Reynolds  number,  and  the  yielding 
wall  physical  values  can  be  modified  via  "inquire"  commands, 
to  facilitate  experiments.  The  tradeoff  quality  versus 
computational  time  is  determined  by  the  number  of  targets,  the 
scheme  of  interpolation,  the  number  of  sections  and  the 
sampling  frequency.  The  warping  of  frequencies  due  to  the 
continuous-to-discrete  conversion  is  kept  low  by  using  30 
vocal-tract  sections  and  a sampling  frequency  of  30  kHz. 

5.2.3  Acoustic-t o-Art iculatory  Transformation 

The  determination  of  proper  vocal  tract  configurations 
from  the  speech  signal  is  difficult  due  to  nonuniqueness  of 
the  inverse  mapping.  The  drawbacks  of  most  methods  are  the 
high  computational  burden,  the  complexity  of  the  measurement 
apparatus  and  the  lack  of  robustness  for  achieving  tracking  of 
the  articulator  positions. 

Since  this  problem  is  important  for  the  development  of 
articulatory  synthesizers,  it  seems  that  we  could  accomplish 
a contribution  to  this  matter. 
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We  have  adopted  a new  optimization  scheme  that 
concatenates  a gradient  search  and  linear  successive 
approximation,  providing  fast  convergence,  very  small  errors, 
and  natural  articulatory  dynamics.  The  objective  function  is 
generated  by  the  least-absolute-value  (li-norm)  error  between 
the  model-derived  and  the  speech-derived  first  three  formants. 
The  gradient  search  is  accelerated  by  using  an  algorithm 
inspired  by  the  Fletcher-Reeves  Method.  The  optimal  step  size 
is  determined  by  interpolation.  Constraints  are  imposed  on 
the  articulatory  parameters  to  avoid  physiologically 
impossible  configurations.  In  the  first  step  of  the  algorithm, 
the  placement  of  the  articulator  responsible  for  the 
constriction  is  optimized  (tongue  body  for  vowels  or  tongue 
tip  for  consonants) . Then,  a multidimensional  optimization  is 
applied  to  the  lip  height  and  protrusion,  jaw  angle,  hyoid 
shift,  and  tongue-tip,  tongue-body  and  velum  coordinates.  In 
the  next  phase  of  the  algorithm,  the  linear  successive 
approximation  method,  combines  sets  of  3 articulatory 
parameters,  to  assure  the  convergence,  mainly  in  the  region 
near  the  optimum  articulatory  vector.  Local  minima  that  could 
trap  the  gradient  search  are  circumvented  with  this  procedure, 
and,  typically,  errors  less  than  2%  can  be  achieved.  The  lower 
pharynx  width  and  length,  considered  fixed  in  most 
articulatory  models,  can  be  adjusted  in  two  cases:  1)  if  a 
significant  decrease  in  the  error  can  be  obtained  and  2)  to 
simulate  female  tracts.  The  error  can  be  further  reduced  by 
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applying  a cascade  of  one-dimensional  optimization  procedures. 
Graphical  aids  are  available.  The  program  is  driven  by 
"button"  macro  commands.  The  updated  vocal  tract 
configuration,  area  function,  articulatory  and  acoustic 
vector,  and  error  are  displayed  on  the  screen  during  all 
phases . 

Proper  articulatory  dynamics  are  achieved  by  considering 
the  vocal  tract  losses  in  the  area-to-formant  forward 
transformation,  by  establishing  good  initial  configurations, 
by  properly  selecting  the  parameters  for  the  optimization 
procedure,  by  imposing  constraints  on  the  relative  placement 
of  articulators,  and  by  using  flexible  on-line  pictorial  aids. 

5.2.4  Validation  and  Experiments 

The  articulatory  synthesizer  has  been  validated  by 
conducting  some  simple  experiments  on  male/female  conversion 
and  on  pathological  voice  simulations.  Our  purpose  has  been 
to  make  it  evident  that  the  articulatory  synthesizer  can  deal 
with  some  features  that  are  not  available  in  the  formant  and 
LPC  synthesizers:  1)  improper  articulation,  tumors,  cleft 
palate,  and  obstructions  in  the  vocal  tract,  2)  realistic 
length  and  cross-sectional  area  for  female  and  children's 
voices,  and  3)  dimensions  and  weights  of  the  vocal  folds. 

The  parameters  were  first  modified  individually  and  then 
in  subsets. 

Another  important  point  in  the  process  of  validation  is 
that  our  synthesizer  can  be  optionally  driven  directly  by  a 
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volume-velocity  excitation,  instead  of  using  a glottal  circuit 
controlled  by  the  glottal  area.  Consequently,  two  additional 
experiments  can  be  conducted:  1)  to  drive  the  articulatory 
synthesizer  with  the  well-established  formant  synthesizer 
glottal  source,  in  order  to  replicate  the  pathological  voice 
simulation  2)  to  drive  the  articulatory  synthesizer  with  the 
volume-velocity  derived  by  inverse  filtering,  to  verify 
quality  and  naturalness. 

Another  option  enabled  by  our  program  is  to  use  the 
tridimensional  vocal  fold  model  (Hu  and  Childers,  1991)  to 
assess  the  effect  of  vocal  fold  nodules  and  polyps. 

5 . 3 Suggestions  for  Future  Research 

Various  issues  related  to  articulatory  synthesizers 
require  more  investigation.  Our  suggestions  for  the  next 
phases  are:  1)  exhaustive  experimentation,  2)  assessment  of 
losses  3)  inverse  mapping  for  nasals,  4)  interpolation  of  area 
functions,  5)  implementation  of  the  tracts,  radiation,  and 
noise  source,  using  wave  digital  filters,  6)  optimization 
scheme  for  the  glottal  source,  7)  artificial  neural  network 
for  the  codebooks,  8)  text-to-speech  synthesis,  and,  9) 
speaker-independent  speech  recognition. 

5.3.1  Exhaustive  Experimentation 

An  exhaustive  and  comprehensive  experimentation  should 
encompass : 
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Effect  of  varying  the  number  of  vocal  tract  sections  on 
the  quality  of  the  speech  (Wakita  and  Fant,  1978;  Maeda, 
1982) ; 

Effect  of  varying  the  sampling  frequency  (Maeda,  1982); 

Effect  of  varying  the  glottal  parameters  and  of 
inserting  turbulent  noise  on  the  simulation  of  modal,  creaky, 
breathy,  rough  and  hoarse  voices  (Lalwani  and  Childers,  1991); 

Effect  of  vocal  fold  nodules  and  polyps,  cord  thickening 
on  the  produced  speech  (Hu  and  Childers,  1991)  . 

Effect  of  varying  the  placement  of  the  noise  source  for 
fricatives,  stops  and  aspirated  sounds  (Flanagan  et  al.,  1975; 
Shadle  (1985) ; Sondhi  and  Schroeter  (1987)  . 

5.3.2  Assessment  of  Losses 
The  issues  are: 

Effect  of  the  subglottal  load  on  the  quality  (Allen  and 
Strong,  1985,  p.  68;  Klatt,  1987,  p.  747); 

Effect  of  the  radiation  through  walls; 

Effect  of  neglecting  viscous  friction  loss; 

Effect  of  neglecting  heat  conduction  loss; 

Simulation  of  the  yielding  wall  losses  (distributed, 
lumped,  neglected,  without  the  capacitance) , their  effect  on 
speech  (Ishizaka  et  al.,  1975;  Fant  et  al.,  1976;  Lin,  1990), 
and  their  effect  on  the  glottal  waveforms  in  the  case  of  stop 
consonants  (Bocchieri,  1983,  p.  105); 

Lumping  all  the  losses  in  a few  locations  of  the  tract. 


242 


5.3.3  Inverse  Mapping  of  Nasals 

The  point  is  to  use  an  area-to-formant  forward 
transformation  that  considers  the  nasal  tract.  Algorithms  are 
provided  by  Fant  (1985),  Maeda  (1982),  Flanagan  (1972a),  Atal 
et  al.  (1978),  etc.  The  effect  of  varying  the  velopharyngeal 
port  and  of  considering  the  sinuses  could  then  be  evaluated 
for  the  new  values  (Maeda,  1982;  Childers  and  Ding,  1991)  . 
Other  experiments  that  could  be  conducted: 

Denasalization  : simulation  of  voices  of  subjects  with 
nasal  congestion  ("clogged  nose"),  by  increasing  the  nasal 
tract  viscous  friction  or  closing  the  velopharyngeal  port 
during  the  phonation  of  nasals. 

Cleft  palate  and  nasalization:  effect  of  improper 
opening  of  the  velopharyngeal  port  area  on  the  quality  and 
loudness  of  the  speech  (Bocchieri,  1983;  Nord  and  Ericsson 
1985) . 

5.3.4  Interpolation  of  Areas  and  Parameters 

We  have  used  a linear  interpolation  for  the  area  function 
and  glottal  parameters  in  between  the  targets.  Some 
articulators,  such  as  the  tongue  tip  and  lips,  can  move  more 
rapidly  than  others,  such  as  the  tongue  body  or  the  hyoid,  in 
certain  transitions  (Borden  and  Harris,  1984)  . Thus,  a new 
interpolation  scheme  may  result  in  an  improvement  of  the 
quality  and  naturalness  of  speech,  without  an  excessive 
computational  burden,  since  the  number  of  targets  can  be 
reduced.  A approach  similar  to  that  one  used  by  Pinto  et  al. 
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(1989)  for  formant  tracking  could  be  considered.  Example 
interpolation  functions  that  can  smooth  the  transitions 
between  variables  are  given  by  Parthasarathy  and  Coker  (1990) , 
and  Gupta  and  Schroeter. 

5.3.5  Wave  Digital  Filter  Implementation 

Improvement  on  the  speed  of  synthesis  can  be  accomplished 
by  using  wave  digital  filters  to  model  the  glottal  source,  the 
tracts,  the  noise  source  and  the  radiation.  The  first  step 
could  be  a digital  realization  that  neglects  losses,  source- 
tract  interaction  and  oversimplifies  the  glottal  model,  in 
order  to  achieve  real-time  synthesis  (Meyer  et  al.,  1989). 
The  second  step  could  be  the  inclusion  of  losses  and  the  use 
of  our  excitation  models  (Kabasawa  et  al.,  1983). 

5.3.6  Optimization  Scheme  for  the  Source 

The  knowledge  gained  from  our  approach  for  deriving  vocal 
tract  area  functions  may  benefit  the  implementation  of  a 
scheme  for  optimizing  glottal  parameters.  The  squared 
difference  between  the  logarithm  of  the  squared  magnitude  of 
the  synthesized  and  the  original  spectra,  summed  over  all 
frequencies  has  been  used  as  the  objective  function  by  some 
researchers  (Flanagan  et  al.,  19875,  1980;  Levinson  and 

Schmidt,  1983;  Gupta  and  Schroeter,  1991)  . 

5.3.7  Neural  Networks  or  Dynamic  Programming 

After  a comprehensive  database  is  established,  a 
codebook-lookup  method  could  be  implemented,  based  on  dynamic 
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programming  or  on  artificial  neural  networks  (Xue  et  al . , 
1990;  Kobayashi  et  al.,  1991;  Rahim  et  al.,  1991). 

5.3.8  Rules  for  Text-to-Speech  Synthesis 

We  suggest  that  the  investigation  concentrate  on  sentence 
prosody.  The  first  step  would  be  to  gain  knowledge  about  the 
set  of  existent  phonological  rules:  intensity  pattern, 
duration  pattern,  and  fundamental  frequency  pattern  for 
allophones  (Klatt,  1987;  Scully,  1987) . The  second  step  would 
be  to  achieve  additional  rules  or  improvements. 

5.3.9  Speaker-Independent  Speech  Recognition 

The  difficulties  in  speaker-independent  speech 
recognition  are  related  to  the  coarticulation  compensation  and 
speaker  adaptation.  It  is  expected  that  these  difficulties 
can  be  overcome,  by  using  reliable  articulatory  movements  to 
label  and  classify  the  phonemes  (Shirai  and  Kobayashi,  1985; 
Kobayashi  et  al.,  1991). 


APPENDIX  A 

INVERSE  MAPPING  SUBROUTINES 

ANUL.LSP  : nullifies  some  variables  and  functions. 

AR1.LSP  : finds  the  area  in  the  lower  pharyngeal  region. 

AR2.LSP  : finds  the  area  in  the  soft-palate  region. 

AR23.LSP  : finds  the  area  in  the  hard-palate  region. 

AR4.LSP  : finds  the  area  in  the  region  between  the  alveolar 
ridge  and  incisors. 

AR5.LSP  : finds  the  area  in  the  labial  region. 

AR9.LSP  : finds  the  area  in  the  upper  pharyngeal  region. 

ARP.LSP  : determines  equally-spaced  points  on  an  arc 
(extremities  of  sagittal  lines  on  the  vocal  tract  wall) . 
ARTB.F  : maps  area  function  into  formants  (FI  to  F3) . 
ART4.LSP  : maps  area  function  into  formants  (FI  to  F4) . 
ARTN.F  : maps  vocal-tract  area  function  into  formants  and 

finds  the  equal-length-section  area  function,  and  writes  it 
into  the  file  "areafun .mat " (codebook). 

ARTPAR.LSP  : writes  the  articulatory  parameters,  the 

dimensions  of  the  fixed  structure,  the  formant  values  and  the 
correspondent  error  in  the  file  "artpar . dat " (codebook) . 
CINT.LSP  : determines  the  intersection  of  a straight  line 
(sagittal)  and  an  arc  (vocal  tract  wall) . 
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DEMO.LSP  : modifies  the  size  and  rotates  the  text  "MMIRC: 
ARTICULATORY  SYNTHESIZER." 

DIST.LSP  : writes  the  distances  between  the  midpoints  of  two 
consecutive  sagittal  lines  into  the  file  "dist.mat." 
DRAR.LSP:  draws  the  vocal  tract  area  function. 

DRAWAL.LSP:  draws  the  whole  vocal  tract  outline. 

DRAWFI.LSP:  draws  the  lower  pharynx  ("fixed  structure"). 
EIM.LSP  : generates  the  coordinates  of  the  error,  for  plotting 
purposes . 

ERG5.LSP  : generates  the  error  between  the  set  of  three 

formants  from  the  formant  contour  (target)  and  that  derived  by 
the  model,  after  a small  increment  in  the  parameters  (during 
the  gradient  process)  . It  calls  the  "ima5.1sp"  and  "artb.f" 
subroutines . 

ERI.LSP  : generates  the  error  between  the  set  of  three 

formants  from  the  formant  contour  and  that  derived  from  the 
model . 

ERIP.LSP  : generates  the  error  between  the  set  of  three 

formants  from  the  formant  contour  and  that  derived  from  the 
model . 

ERI4.LSP  : generates  the  error  between  the  set  of  four 

formants  from  the  formant  contour  and  that  derived  from  the 
model . 

F.LSp  : generates  the  coordinates  of  the  formant  points,  for 
plotting  purposes  (macro  "formant") . 
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FIM.LSP  : generates  the  coordinates  of  the  formant  points,  for 
plotting  purposes  (macro  "plotinit")  . 

FORMAN. LSP  : writes  the  area  function  into  the  file 

"areart .mat . " 

GRAD10.LSP:  performs  the  gradient  algorithm,  using  a ten- 

parameter  articulatory  vector  (for  comparison  purposes) . 
GRADFR.LSP  : performs  a modified  version  of  the  Fletcher- 
Reeves  gradient,  using  a 8-parameter  articulatory  vector. 
GRADHFI . LSP  : performs  the  gradient  algorithm,  using  a six- 
parameter  articulatory  vector  (lower  pharynx  and  hyoid) . 
GRADLJT.LSP  : performs  the  gradient  algorithm,  using  a five- 
parameter  articulatory  vector  (jaw,  tongue  tip  and  lips) . 
GRADTB . LSP  : performs  the  gradient  algorithm  only  for  the 
tongue  body. 

HYY.LSP  : determine  the  center  of  the  arc  that  represent  the 
lower-anterior  wall  of  the  vocal  tract  (H-PP  in  Fig  2.1). 
IM5.MNU  : menu  for  the  articulatory  model. 

IMA4.LSP  : determines  the  outline  of  the  vocal  tract,  the 
sagittal  distances  and  the  cross-sectional  areas. 

IMA5.LSP  : determines  the  outline  of  the  vocal  tract,  the 
sagittal  distances  and  the  cross-sectional  areas.  No  "load" 
or  "nil"  operations  are  used  in  order  to  decrease  the  loop 
time . 

IMALOA . LSP  : loads,  previously  to  the  loops  of  the  gradient, 
all  the  subroutines  used  in  "ima5.1sp." 
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IMANIL  : nullifies  all  the  subroutines  and  functions  used  in 
"ima5 . lsp . " 

LI.LSP  : determines  the  default  value  for  the  tongue  tip 

(macro  "Pert:  Vow"). 

LINSVD.C  : solves  a liner  system  of  equations  using  the 

singular  value  decomposition  algorithm.  The  unknowns  are  the 
increments  of  the  parameters,  in  the  successive  approximation 
method. 

MARC. LSP  : determines  the  centers  of  the  arcs  B-T  and  T-PF 
MD.LSP  : determines  the  midpoint  of  a segment  of  line  and  the 
distance  between  the  endpoints. 

MID. LSP  : determines  the  midpoint  of  a segment  of  line 
PHV.LSP  : finds  the  area  function,  formants  and  corresponding 
error  for  the  incremented  values  of  the  hyoid  and  velum 
parameters,  in  the  successive  approximation  method. 

PJT.LSP  : finds  the  area  function,  formants  and  corresponding 
error  for  the  incremented  values  of  the  jaw  and  tongue  body 
parameters,  in  the  successive  approximation  method. 

PTL.LSP  : finds  the  area  functions,  formants  and  corresponding 
errors  for  the  incremented  values  of  the  tongue  tip  and  lip 
(height)  parameters,  in  the  successive  approximation  method. 
PGC.LSP  : determines  the  sagittal  lines  in  the  region  from  the 
maximum  point  in  the  maxilla,  M,  to  the  incisors  U (except  for 
the  sagittal  line  that  contains  the  point  N) . 
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PGN.LSP  : determines  the  sagittal  line  that  contains  the  point 
N,  providing  to  the  macro  [IM:  Init]  the  flag  that  controls 
the  change  in  the  color  of  the  outline  (pol) . 

PL0SE4.LSP  : prompts  for  the  name  of  the  parameter  and  for  its 
increment,  draws  the  vocal  tract  outlines  for  the  successive 
incremented  values  of  this  parameter,  finds  the  corresponding 
area  functions,  and  plots  the  formants  and  errors. 

RVEC4.LSP  : reads  in  the  articulatory  parameters. 

RVEC15.LSP  : reads  the  data  from  the  file  "artpar.dat" 
(retrieval  from  the  codebook) . 

SAGR.LSP  : determines  the  sagittal  distances  of  the  vocal 
tract  (interactive  model) . 

SHV.LSP  : determines  the  sensitivities  and  increments  for  the 
hyoid  and  velum  coordinates,  in  the  successive  approximation 
method. 

SJT.LSP  : determines  the  sensitivities  and  the  increments  for 
the  jaw  and  tongue  body  coordinates,  in  the  successive 
approximation  method. 

STL.LSP  : determines  the  sensitivities  and  increments  for  the 
tongue  tip  and  lip  (height)  coordinates,  in  the  successive 
approximation  method. 

UNF.LSP  : determines  the  endpoints  of  the  sagittal  lines  in 
the  fixed  structure  (hard  palate) , and  the  sagittal  lines  and 
areas  of  the  lower  pharynx  (Kl-H  in  Fig.  2.1),  for  a female 
vocal  tract.  The  dimensions  of  the  lower  pharynx  are 


included. 
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UNM.LSP  : determines  the  endpoints  of  the  sagittal  lines  in 
the  fixed  structure  (hard  palate) , and  the  sagittal  lines  and 
areas  of  the  lower  pharynx  (Kl-H  in  Fig.  2.1),  for  a male 
vocal  tract. 

UNMSE.LSP  : determines  the  endpoints  of  the  sagittal  lines  in 
the  fixed  structure  (hard  palate) , and  the  sagittal  lines  and 
areas  of  the  lower  pharynx  (Kl-H  in  Fig.  2.1)  . The  dimensions 
of  the  lower  pharynx  are  not  included  in  order  to  enable  the 
process  of  optimization  of  the  fixed  structure. 

VAREA.LSP  : determines  the  area  function  for  the  interactive 
model . 

VTAREA.M  : overlaps  the  plots  of  the  area  function  and  of  the 
derived  equal-length-section  area  function. 

WACV.LSP  : writes  the  formant  targets  into  the  file  "acvec.in" 
(retrieved  from  the  codebook) . 

W.LSP  : generates  the  coordinates  of  the  area  function  points, 
for  plotting  purposes  (tridimensional  representation) . 
W1234.LSP  : writes  the  formant  values  onto  the  screen. 
WFIX.LSP  : writes  the  dimensions  of  the  fixed  structure  into 
the  file  "fixpar.dat." 

WP.LSP  : generates  the  coordinates  of  the  area  function 

points,  for  plotting  purposes  (bidimensional  representation) . 
WSVEC4.LSP  : writes  the  articulatory  parameters  onto  the 

screen . 

WVEC4.LSP  : writes  the  articulatory  parameter  into  the  file 


"arvec4 . in . 


APPENDIX  B 
DATA  FILES 


ACVEC.IN  : Acoustical  vector:  first,  second  and  third 

formants  from  the  formant  contour  (targets) . 

AREART . MAT  : area  function  (58  cross-sectional  areas). 
ARTPAR.DAT  : initial  storage  for  10  articulatory  parameters, 
9 dimensions  of  the  fixed  structure,  formant  targets  F1F,  F2F, 
F3F , model-derived  formants  FI,  F2,  F3  and  Error.  The 

filename  is  renamed  for  each  phoneme  in  the  codebook. 

ARVEC4 . IN  : articulatory  vector  (10  parameters) . 

DIST.MAT  : distances  between  the  successive  midpoints  of  the 
sagittal  distances  (58)  . 

FCONT.DAT:  formant  contour  from  the  Formant  Synthesizer. 
FIXPAR.DAT  : dimensions  of  the  fixed  structure  (4) . 

FORMAN. OUT  : values  of  FI,  F2,  F3  and  F4,  generated  by 
"artb.f"  and  "artn.f"  (FI  to  F3)  or  "art4.f"  (FI  to  F4) . 

PERT . DAT  : increments  of  the  articulatory  parameters,  obtained 
from  "linsvd.c,"  for  the  successive  approximation  method. 
SENS. IN  : lines  of  the  matrix  of  sensitivities,  and  the  shifts 
of  the  formant  frequencies.  It  is  the  input  to  "linsvd.c"  in 
the  successive  approximation  method. 
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